Get the FREE Ultimate OpenClaw Setup Guide →

Ceres

npx machina-cli add skill AndreaBozzo/Ceres-Claude-Skill/ceres --openclaw
Files (1)
SKILL.md
6.8 KB

Ceres — Semantic Search Engine for Open Data Portals

Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.

Repository: https://github.com/AndreaBozzo/Ceres License: Apache-2.0 | Rust edition: 2024 | MSRV: 1.87+

Pipeline

Portal URL → PortalClient (fetch metadata) → DeltaDetector (content_hash)
  → EmbeddingProvider (vector) → DatasetStore (upsert with pgvector)

Each stage is a trait, so every component can be swapped or mocked independently.

Crate Map

CratePurposeKey Exports
ceres-coreBusiness logic, traits, servicesHarvestService, SearchService, ExportService, WorkerService, CircuitBreaker, traits
ceres-clientCKAN API client, Gemini/OpenAI clientsCkanClient, GeminiClient, OpenAIClient, PortalClientFactoryEnum, EmbeddingProviderEnum
ceres-dbPostgreSQL + pgvector repositoryDatasetRepository, HarvestJobRepository
ceres-serverAxum REST API with Swagger UIRoutes, DTOs, bearer auth, OpenAPI/Swagger
ceres-cliCommand-line interfaceharvest, search, export, stats subcommands

Core Traits (ceres-core::traits)

pub trait EmbeddingProvider: Send + Sync + Clone {
    fn name(&self) -> &'static str;
    fn dimension(&self) -> usize;
    fn generate(&self, text: &str) -> impl Future<Output = Result<Vec<f32>, AppError>> + Send;
    fn max_batch_size(&self) -> usize { 1 }
    fn generate_batch(&self, texts: &[String]) -> impl Future<Output = Result<Vec<Vec<f32>>, AppError>> + Send;
}

pub trait PortalClient: Send + Sync + Clone {
    type PortalData: Send;
    fn portal_type(&self) -> &'static str;
    fn base_url(&self) -> &str;
    fn list_dataset_ids(&self) -> impl Future<Output = Result<Vec<String>, AppError>> + Send;
    fn get_dataset(&self, id: &str) -> impl Future<Output = Result<Self::PortalData, AppError>> + Send;
    fn into_new_dataset(data: Self::PortalData, portal_url: &str, url_template: Option<&str>, language: &str) -> NewDataset;
    fn search_modified_since(&self, since: DateTime<Utc>) -> impl Future<Output = Result<Vec<Self::PortalData>, AppError>> + Send;
    fn search_all_datasets(&self) -> impl Future<Output = Result<Vec<Self::PortalData>, AppError>> + Send;
}

pub trait PortalClientFactory: Send + Sync + Clone {
    type Client: PortalClient;
    fn create(&self, portal_url: &str, portal_type: PortalType) -> Result<Self::Client, AppError>;
}

pub trait DatasetStore: Send + Sync + Clone {
    fn get_by_id(&self, id: Uuid) -> impl Future<Output = Result<Option<Dataset>, AppError>> + Send;
    fn get_hashes_for_portal(&self, portal_url: &str) -> impl Future<Output = Result<HashMap<String, Option<String>>, AppError>> + Send;
    fn upsert(&self, dataset: &NewDataset) -> impl Future<Output = Result<Uuid, AppError>> + Send;
    fn batch_upsert(&self, datasets: &[NewDataset]) -> impl Future<Output = Result<Vec<Uuid>, AppError>> + Send;
    fn search(&self, query_vector: Vec<f32>, limit: usize) -> impl Future<Output = Result<Vec<SearchResult>, AppError>> + Send;
    fn list_stream<'a>(&'a self, portal_filter: Option<&'a str>, limit: Option<usize>) -> BoxStream<'a, Result<Dataset, AppError>>;
    fn get_last_sync_time(&self, portal_url: &str) -> impl Future<Output = Result<Option<DateTime<Utc>>, AppError>> + Send;
    fn record_sync_status(&self, portal_url: &str, sync_time: DateTime<Utc>, sync_mode: &str, sync_status: &str, datasets_synced: i32) -> impl Future<Output = Result<(), AppError>> + Send;
    fn health_check(&self) -> impl Future<Output = Result<(), AppError>> + Send;
    // + update_timestamp_only, batch_update_timestamps, get_duplicate_titles
}

Key Types

TypeModulePurpose
Datasetceres_core::modelsComplete dataset row (id, original_id, source_portal, url, title, description, embedding, metadata, timestamps, content_hash)
NewDatasetceres_core::modelsInsert/update DTO. Has compute_content_hash() for delta detection
SearchResultceres_core::modelsDataset + similarity_score (0.0-1.0)
DatabaseStatsceres_core::modelstotal_datasets, datasets_with_embeddings, total_portals, last_update
HarvestJobceres_core::jobQueued harvest job with status, retry info, portal config
JobStatusceres_core::jobEnum: Pending, Running, Completed, Failed, Cancelled
SyncStatsceres_core::synccreated, updated, unchanged, failed, skipped counts
SyncOutcomeceres_core::syncPer-dataset outcome: Created, Updated, Unchanged, Failed, Skipped
BatchHarvestSummaryceres_core::syncAggregated results from batch harvesting multiple portals
PortalEntryceres_core::configPortal config: name, url, type, enabled, url_template, language
AppErrorceres_core::errorError enum with is_retryable() and should_trip_circuit()
CircuitBreakerceres_core::circuit_breakerClosed -> Open -> HalfOpen state machine

Quick Start

# Install
cargo install ceres-search

# Start PostgreSQL + pgvector
docker compose up db -d

# Configure
cp .env.example .env  # Edit with your Gemini/OpenAI API key

# Run migrations
make migrate

# Harvest a portal
ceres harvest https://dati.comune.milano.it

# Harvest all configured portals
ceres harvest

# Search
ceres search "trasporto pubblico" --limit 5

# Export
ceres export --format jsonl > datasets.jsonl

# Stats
ceres stats

Reference Guides

TopicFileWhen to Read
Architecture deep-divereferences/architecture.mdUnderstanding crate graph, services, error handling, database schema
CLI & REST APIreferences/cli-and-server.mdRunning CLI commands, calling API endpoints, env vars, deployment
Harvesting systemreferences/harvesting.mdTwo-tier optimization, delta detection, streaming, circuit breaker
Extending Ceresreferences/extending.mdImplementing custom EmbeddingProvider, PortalClient, or DatasetStore
Contributingreferences/contributing.mdDev setup, testing, CI, code style

Version Notes

  • Current version: 0.3.0
  • crates.io package: ceres-search
  • Supports Gemini (768d, gemini-embedding-001) and OpenAI (1536d/3072d, text-embedding-3-small/large) embeddings
  • 25+ pre-configured CKAN portals (354k+ datasets)
  • HuggingFace dataset: AndreaBozzo/ceres-open-data-index

Source

git clone https://github.com/AndreaBozzo/Ceres-Claude-Skill/blob/main/ceres/SKILL.mdView on GitHub

Overview

Ceres harvests CKAN metadata from open data portals and indexes it with vector embeddings (pgvector) to enable semantic search across portals. It ships a CLI (harvest, search, export, stats), a REST API, and a portals.toml config for multiple portals. Its architecture is trait-based, allowing components like PortalClient, EmbeddingProvider, and DatasetStore to be swapped or extended.

How This Skill Works

Core pipeline: PortalClient fetches dataset metadata, DeltaDetector computes content_hash, EmbeddingProvider generates vectors, and DatasetStore upserts with pgvector. Because every stage is a trait, you can swap components (e.g., different PortalClient or embedding provider) without touching business logic.

When to Use It

  • Index CKAN portals to enable semantic search across datasets.
  • Perform cross-portal discovery with natural-language queries via the CLI search.
  • Keep the index fresh by delta-detecting changes and batch upserts.
  • Extend the system by implementing custom EmbeddingProvider or PortalClient traits.
  • Expose search results through the REST API and use export for downstream analytics.

Quick Start

  1. Step 1: Install Ceres and configure a portals.toml with at least one portal.
  2. Step 2: Run ceres-cli harvest to ingest metadata and index embeddings.
  3. Step 3: Run ceres-cli search "your query" to verify semantic results.

Best Practices

  • Start with portals.toml configured for a single portal to validate the workflow.
  • Use batch embeddings (generate_batch) and batch_upsert for large ingests to improve throughput.
  • Monitor get_last_sync_time and record_sync_status to track portal freshness.
  • Leverage pgvector by ensuring embedding dimensions match provider outputs.
  • Experiment with Gemini or OpenAI providers and compare latency and embedding quality.

Example Use Cases

  • A city open data portal indexing datasets across departments to enable semantic search.
  • A university CKAN portal using Gemini for embeddings and exposing a REST API for frontend apps.
  • A multi-portal catalog configured in portals.toml with a unified semantic search API.
  • Scheduled harvests with delta detection to keep the index up to date and generate stats.
  • Using the export command to push search results to downstream data analytics pipelines.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers