Ceres
npx machina-cli add skill AndreaBozzo/Ceres-Claude-Skill/ceres --openclawCeres — Semantic Search Engine for Open Data Portals
Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.
Repository: https://github.com/AndreaBozzo/Ceres License: Apache-2.0 | Rust edition: 2024 | MSRV: 1.87+
Pipeline
Portal URL → PortalClient (fetch metadata) → DeltaDetector (content_hash)
→ EmbeddingProvider (vector) → DatasetStore (upsert with pgvector)
Each stage is a trait, so every component can be swapped or mocked independently.
Crate Map
| Crate | Purpose | Key Exports |
|---|---|---|
ceres-core | Business logic, traits, services | HarvestService, SearchService, ExportService, WorkerService, CircuitBreaker, traits |
ceres-client | CKAN API client, Gemini/OpenAI clients | CkanClient, GeminiClient, OpenAIClient, PortalClientFactoryEnum, EmbeddingProviderEnum |
ceres-db | PostgreSQL + pgvector repository | DatasetRepository, HarvestJobRepository |
ceres-server | Axum REST API with Swagger UI | Routes, DTOs, bearer auth, OpenAPI/Swagger |
ceres-cli | Command-line interface | harvest, search, export, stats subcommands |
Core Traits (ceres-core::traits)
pub trait EmbeddingProvider: Send + Sync + Clone {
fn name(&self) -> &'static str;
fn dimension(&self) -> usize;
fn generate(&self, text: &str) -> impl Future<Output = Result<Vec<f32>, AppError>> + Send;
fn max_batch_size(&self) -> usize { 1 }
fn generate_batch(&self, texts: &[String]) -> impl Future<Output = Result<Vec<Vec<f32>>, AppError>> + Send;
}
pub trait PortalClient: Send + Sync + Clone {
type PortalData: Send;
fn portal_type(&self) -> &'static str;
fn base_url(&self) -> &str;
fn list_dataset_ids(&self) -> impl Future<Output = Result<Vec<String>, AppError>> + Send;
fn get_dataset(&self, id: &str) -> impl Future<Output = Result<Self::PortalData, AppError>> + Send;
fn into_new_dataset(data: Self::PortalData, portal_url: &str, url_template: Option<&str>, language: &str) -> NewDataset;
fn search_modified_since(&self, since: DateTime<Utc>) -> impl Future<Output = Result<Vec<Self::PortalData>, AppError>> + Send;
fn search_all_datasets(&self) -> impl Future<Output = Result<Vec<Self::PortalData>, AppError>> + Send;
}
pub trait PortalClientFactory: Send + Sync + Clone {
type Client: PortalClient;
fn create(&self, portal_url: &str, portal_type: PortalType) -> Result<Self::Client, AppError>;
}
pub trait DatasetStore: Send + Sync + Clone {
fn get_by_id(&self, id: Uuid) -> impl Future<Output = Result<Option<Dataset>, AppError>> + Send;
fn get_hashes_for_portal(&self, portal_url: &str) -> impl Future<Output = Result<HashMap<String, Option<String>>, AppError>> + Send;
fn upsert(&self, dataset: &NewDataset) -> impl Future<Output = Result<Uuid, AppError>> + Send;
fn batch_upsert(&self, datasets: &[NewDataset]) -> impl Future<Output = Result<Vec<Uuid>, AppError>> + Send;
fn search(&self, query_vector: Vec<f32>, limit: usize) -> impl Future<Output = Result<Vec<SearchResult>, AppError>> + Send;
fn list_stream<'a>(&'a self, portal_filter: Option<&'a str>, limit: Option<usize>) -> BoxStream<'a, Result<Dataset, AppError>>;
fn get_last_sync_time(&self, portal_url: &str) -> impl Future<Output = Result<Option<DateTime<Utc>>, AppError>> + Send;
fn record_sync_status(&self, portal_url: &str, sync_time: DateTime<Utc>, sync_mode: &str, sync_status: &str, datasets_synced: i32) -> impl Future<Output = Result<(), AppError>> + Send;
fn health_check(&self) -> impl Future<Output = Result<(), AppError>> + Send;
// + update_timestamp_only, batch_update_timestamps, get_duplicate_titles
}
Key Types
| Type | Module | Purpose |
|---|---|---|
Dataset | ceres_core::models | Complete dataset row (id, original_id, source_portal, url, title, description, embedding, metadata, timestamps, content_hash) |
NewDataset | ceres_core::models | Insert/update DTO. Has compute_content_hash() for delta detection |
SearchResult | ceres_core::models | Dataset + similarity_score (0.0-1.0) |
DatabaseStats | ceres_core::models | total_datasets, datasets_with_embeddings, total_portals, last_update |
HarvestJob | ceres_core::job | Queued harvest job with status, retry info, portal config |
JobStatus | ceres_core::job | Enum: Pending, Running, Completed, Failed, Cancelled |
SyncStats | ceres_core::sync | created, updated, unchanged, failed, skipped counts |
SyncOutcome | ceres_core::sync | Per-dataset outcome: Created, Updated, Unchanged, Failed, Skipped |
BatchHarvestSummary | ceres_core::sync | Aggregated results from batch harvesting multiple portals |
PortalEntry | ceres_core::config | Portal config: name, url, type, enabled, url_template, language |
AppError | ceres_core::error | Error enum with is_retryable() and should_trip_circuit() |
CircuitBreaker | ceres_core::circuit_breaker | Closed -> Open -> HalfOpen state machine |
Quick Start
# Install
cargo install ceres-search
# Start PostgreSQL + pgvector
docker compose up db -d
# Configure
cp .env.example .env # Edit with your Gemini/OpenAI API key
# Run migrations
make migrate
# Harvest a portal
ceres harvest https://dati.comune.milano.it
# Harvest all configured portals
ceres harvest
# Search
ceres search "trasporto pubblico" --limit 5
# Export
ceres export --format jsonl > datasets.jsonl
# Stats
ceres stats
Reference Guides
| Topic | File | When to Read |
|---|---|---|
| Architecture deep-dive | references/architecture.md | Understanding crate graph, services, error handling, database schema |
| CLI & REST API | references/cli-and-server.md | Running CLI commands, calling API endpoints, env vars, deployment |
| Harvesting system | references/harvesting.md | Two-tier optimization, delta detection, streaming, circuit breaker |
| Extending Ceres | references/extending.md | Implementing custom EmbeddingProvider, PortalClient, or DatasetStore |
| Contributing | references/contributing.md | Dev setup, testing, CI, code style |
Version Notes
- Current version: 0.3.0
- crates.io package:
ceres-search - Supports Gemini (768d,
gemini-embedding-001) and OpenAI (1536d/3072d,text-embedding-3-small/large) embeddings - 25+ pre-configured CKAN portals (354k+ datasets)
- HuggingFace dataset:
AndreaBozzo/ceres-open-data-index
Source
git clone https://github.com/AndreaBozzo/Ceres-Claude-Skill/blob/main/ceres/SKILL.mdView on GitHub Overview
Ceres harvests CKAN metadata from open data portals and indexes it with vector embeddings (pgvector) to enable semantic search across portals. It ships a CLI (harvest, search, export, stats), a REST API, and a portals.toml config for multiple portals. Its architecture is trait-based, allowing components like PortalClient, EmbeddingProvider, and DatasetStore to be swapped or extended.
How This Skill Works
Core pipeline: PortalClient fetches dataset metadata, DeltaDetector computes content_hash, EmbeddingProvider generates vectors, and DatasetStore upserts with pgvector. Because every stage is a trait, you can swap components (e.g., different PortalClient or embedding provider) without touching business logic.
When to Use It
- Index CKAN portals to enable semantic search across datasets.
- Perform cross-portal discovery with natural-language queries via the CLI search.
- Keep the index fresh by delta-detecting changes and batch upserts.
- Extend the system by implementing custom EmbeddingProvider or PortalClient traits.
- Expose search results through the REST API and use export for downstream analytics.
Quick Start
- Step 1: Install Ceres and configure a portals.toml with at least one portal.
- Step 2: Run ceres-cli harvest to ingest metadata and index embeddings.
- Step 3: Run ceres-cli search "your query" to verify semantic results.
Best Practices
- Start with portals.toml configured for a single portal to validate the workflow.
- Use batch embeddings (generate_batch) and batch_upsert for large ingests to improve throughput.
- Monitor get_last_sync_time and record_sync_status to track portal freshness.
- Leverage pgvector by ensuring embedding dimensions match provider outputs.
- Experiment with Gemini or OpenAI providers and compare latency and embedding quality.
Example Use Cases
- A city open data portal indexing datasets across departments to enable semantic search.
- A university CKAN portal using Gemini for embeddings and exposing a REST API for frontend apps.
- A multi-portal catalog configured in portals.toml with a unified semantic search API.
- Scheduled harvests with delta detection to keep the index up to date and generate stats.
- Using the export command to push search results to downstream data analytics pipelines.