Ares
Scannednpx machina-cli add skill AndreaBozzo/Ares-Claude-Skill/ares --openclawAres — LLM-Powered Web Scraper
Ares is a Rust library, CLI, and HTTP server that extracts structured data from websites using LLMs and JSON Schemas.
Repository: https://github.com/AndreaBozzo/Ares License: Apache-2.0 | Rust edition: 2024 | MSRV: 1.88+
Pipeline
URL → Fetcher (HTML) → Cleaner (Markdown) → Extractor (LLM + JSON Schema) → Hash → Compare → Store
Each stage is a trait, so every component can be swapped or mocked independently.
Crate Map
| Crate | Purpose | Key Exports |
|---|---|---|
ares-core | Business logic, traits, pipeline | ScrapeService, WorkerService, CircuitBreaker, ThrottledFetcher, traits |
ares-client | HTTP/browser fetchers, cleaner, LLM client | ReqwestFetcher, BrowserFetcher, HtmdCleaner, OpenAiExtractor |
ares-db | PostgreSQL persistence | Database, ExtractionRepository, ScrapeJobRepository |
ares-api | Axum REST API | Routes, DTOs, bearer auth, OpenAPI/Swagger |
ares-cli | Command-line interface | scrape, history, job, worker subcommands |
Core Traits (ares-core::traits)
pub trait Fetcher: Send + Sync + Clone {
fn fetch(&self, url: &str) -> impl Future<Output = Result<String, AppError>> + Send;
}
pub trait Cleaner: Send + Sync + Clone {
fn clean(&self, html: &str) -> Result<String, AppError>;
}
pub trait Extractor: Send + Sync + Clone {
fn extract(&self, content: &str, schema: &serde_json::Value)
-> impl Future<Output = Result<serde_json::Value, AppError>> + Send;
}
pub trait ExtractorFactory: Send + Sync + Clone {
type Extractor: Extractor;
fn create(&self, model: &str, base_url: &str) -> Result<Self::Extractor, AppError>;
}
pub trait ExtractionStore: Send + Sync + Clone {
fn save(&self, extraction: &NewExtraction) -> impl Future<Output = Result<Uuid, AppError>> + Send;
fn get_latest(&self, url: &str, schema_name: &str) -> impl Future<Output = Result<Option<Extraction>, AppError>> + Send;
fn get_history(&self, url: &str, schema_name: &str, limit: usize, offset: usize) -> impl Future<Output = Result<Vec<Extraction>, AppError>> + Send;
}
JobQueue trait: see ares-core::job_queue — persistent queue with atomic claiming (SELECT FOR UPDATE SKIP LOCKED).
Key Types
| Type | Module | Purpose |
|---|---|---|
Extraction | ares_core::models | Completed extraction (id, url, schema_name, extracted_data, hashes, model, created_at) |
NewExtraction | ares_core::models | Insert DTO (no id/timestamps) |
ScrapeResult | ares_core::models | Pipeline output (extracted_data, hashes, changed flag, extraction_id) |
ScrapeJob | ares_core::job | Queued job with status, retry info, LLM config |
JobStatus | ares_core::job | Enum: Pending, Running, Completed, Failed, Cancelled |
RetryConfig | ares_core::job | Exponential backoff: 1min → 5min → 30min → 60min (capped) |
WorkerConfig | ares_core::job | Worker settings: poll_interval, retry_config, skip_unchanged |
AppError | ares_core::error | Error enum with is_retryable() and should_trip_circuit() |
SchemaResolver | ares_core::schema | CRUD for schemas: resolve, create, update, delete + registry management |
CircuitBreaker | ares_core::circuit_breaker | Closed → Open → HalfOpen state machine |
ThrottledFetcher<F> | ares_core::throttle | Per-domain delay with jitter |
Quick Start (Library Usage)
use ares_client::{ReqwestFetcher, HtmdCleaner, OpenAiExtractor};
use ares_core::{ScrapeService, NullStore};
let fetcher = ReqwestFetcher::new()?;
let cleaner = HtmdCleaner::new();
let extractor = OpenAiExtractor::with_base_url(&api_key, "gpt-4o-mini", "https://api.openai.com/v1")?;
let service = ScrapeService::<_, _, _, NullStore>::new(fetcher, cleaner, extractor, "gpt-4o-mini".into());
let schema = serde_json::json!({
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"}
},
"required": ["title", "author"]
});
let result = service.scrape("https://example.com/blog", &schema, "blog").await?;
println!("{}", serde_json::to_string_pretty(&result.extracted_data)?);
With persistence, use ScrapeService::with_store(fetcher, cleaner, extractor, store, model).
Reference Guides
| Topic | File | When to Read |
|---|---|---|
| Architecture deep-dive | references/architecture.md | Understanding pipeline internals, crate dependencies, resilience patterns |
| JSON Schema system | references/schemas.md | Creating/managing schemas, registry, versioning |
| Extending Ares | references/extending.md | Implementing custom Fetcher/Cleaner/Extractor/Store/JobQueue |
| CLI & REST API | references/cli-and-server.md | Running CLI commands, calling API endpoints, deploying |
| Contributing | references/contributing.md | Dev setup, testing, CI, code style |
Version Notes
- Current version: 0.1.0
- crates.io release: Scheduled for February 29, 2026
- Until then, use git dependency:
ares-core = { git = "https://github.com/AndreaBozzo/Ares" } - Works with any OpenAI-compatible API (OpenAI, Gemini, etc.)
- Browser support requires feature flag:
--features browser
Source
git clone https://github.com/AndreaBozzo/Ares-Claude-Skill/blob/main/ares/SKILL.mdView on GitHub Overview
Ares is a Rust library, CLI, and HTTP server that extracts structured data from websites using LLMs and JSON Schemas. It offers a modular pipeline (fetcher, cleaner, extractor) and supports REST API, schema creation, deployment, and contributor workflows for a flexible, extensible scraping solution.
How This Skill Works
The scraping pipeline runs URL → Fetcher (HTML) → Cleaner (Markdown) → Extractor (LLM + JSON Schema) → Hash → Compare → Store. Each stage is a trait (Fetcher, Cleaner, Extractor) so components can be swapped or mocked, with supporting crates for core logic, HTTP clients, storage, API, and CLI.
When to Use It
- Build a schema-driven scraper that outputs structured JSON according to a defined schema.
- Expose scraped data through a REST API using the ares-api crate and Axum routes.
- Extend the pipeline with custom fetchers, cleaners, or extractors for domain-specific needs.
- Manage and deploy scraping jobs with the CLI and persistent storage (history, jobs, workers).
- Collaborate on the Ares codebase by adding new schemas, components, or contributors guidance.
Quick Start
- Step 1: Add and import components (ReqwestFetcher, HtmdCleaner, OpenAiExtractor) from the ares crates.
- Step 2: Instantiate the components: let fetcher = ReqwestFetcher::new()?; let cleaner = HtmdCleaner::new(); let extractor = OpenAiExtractor::with_base_url("https://api.openai.com/")?;
- Step 3: Wire the pipeline and run a scrape: let service = ScrapeService::new(fetcher, cleaner, extractor, NullStore); service.scrape("https://example.com", "ProductSchema").await?;
Best Practices
- Define and validate clear JSON Schemas for target sites before implementing extraction.
- Use dependency injection to swap Fetcher, Cleaner, and Extractor during testing.
- Leverage ExtractionStore to cache results and access history for better accuracy.
- Test end-to-end pipelines with mock services and isolated components prior to deployment.
- Enable throttling and circuit breaking (ThrottledFetcher, CircuitBreaker) to handle load and reliability.
Example Use Cases
- Scrape product data (title, price, availability) from an e-commerce site into a normalized JSON schema.
- Aggregate job postings (title, company, location, salary) from career portals using ares-core traits.
- Collect restaurant reviews and ratings and normalize into a structured schema for analytics.
- Persist extractions in PostgreSQL via ares-db and serve results through the ares-api REST API.
- Use the ares-cli to run scraping jobs, inspect history, and monitor worker status.