Ares is a Rust library, CLI, and HTTP server that scrapes websites using LLMs and JSON Schemas, with modular components and REST/CLI tooling.

How does the pipeline work?

It fetches HTML, cleans to stable content, runs an LLM extractor guided by a JSON schema, hashes results, compares with history, and stores the extraction.

Can I extend with custom fetchers/cleaners/extractors?

Yes — Ares defines traits for Fetcher, Cleaner, and Extractor, plus an ExtractorFactory and modular crates (ares-core, ares-client, etc.) to plug in your own implementations.

Ares

Scanned

npx machina-cli add skill AndreaBozzo/Ares-Claude-Skill/ares --openclaw

Files (1)

SKILL.md

5.4 KB

Ares — LLM-Powered Web Scraper

Ares is a Rust library, CLI, and HTTP server that extracts structured data from websites using LLMs and JSON Schemas.

Repository: https://github.com/AndreaBozzo/Ares License: Apache-2.0 | Rust edition: 2024 | MSRV: 1.88+

Pipeline

URL → Fetcher (HTML) → Cleaner (Markdown) → Extractor (LLM + JSON Schema) → Hash → Compare → Store

Each stage is a trait, so every component can be swapped or mocked independently.

Crate Map

Crate	Purpose	Key Exports
`ares-core`	Business logic, traits, pipeline	`ScrapeService`, `WorkerService`, `CircuitBreaker`, `ThrottledFetcher`, traits
`ares-client`	HTTP/browser fetchers, cleaner, LLM client	`ReqwestFetcher`, `BrowserFetcher`, `HtmdCleaner`, `OpenAiExtractor`
`ares-db`	PostgreSQL persistence	`Database`, `ExtractionRepository`, `ScrapeJobRepository`
`ares-api`	Axum REST API	Routes, DTOs, bearer auth, OpenAPI/Swagger
`ares-cli`	Command-line interface	`scrape`, `history`, `job`, `worker` subcommands

Core Traits (`ares-core::traits`)

pub trait Fetcher: Send + Sync + Clone {
    fn fetch(&self, url: &str) -> impl Future<Output = Result<String, AppError>> + Send;
}

pub trait Cleaner: Send + Sync + Clone {
    fn clean(&self, html: &str) -> Result<String, AppError>;
}

pub trait Extractor: Send + Sync + Clone {
    fn extract(&self, content: &str, schema: &serde_json::Value)
        -> impl Future<Output = Result<serde_json::Value, AppError>> + Send;
}

pub trait ExtractorFactory: Send + Sync + Clone {
    type Extractor: Extractor;
    fn create(&self, model: &str, base_url: &str) -> Result<Self::Extractor, AppError>;
}

pub trait ExtractionStore: Send + Sync + Clone {
    fn save(&self, extraction: &NewExtraction) -> impl Future<Output = Result<Uuid, AppError>> + Send;
    fn get_latest(&self, url: &str, schema_name: &str) -> impl Future<Output = Result<Option<Extraction>, AppError>> + Send;
    fn get_history(&self, url: &str, schema_name: &str, limit: usize, offset: usize) -> impl Future<Output = Result<Vec<Extraction>, AppError>> + Send;
}

JobQueue trait: see ares-core::job_queue — persistent queue with atomic claiming (SELECT FOR UPDATE SKIP LOCKED).

Key Types

Type	Module	Purpose
`Extraction`	`ares_core::models`	Completed extraction (id, url, schema_name, extracted_data, hashes, model, created_at)
`NewExtraction`	`ares_core::models`	Insert DTO (no id/timestamps)
`ScrapeResult`	`ares_core::models`	Pipeline output (extracted_data, hashes, changed flag, extraction_id)
`ScrapeJob`	`ares_core::job`	Queued job with status, retry info, LLM config
`JobStatus`	`ares_core::job`	Enum: Pending, Running, Completed, Failed, Cancelled
`RetryConfig`	`ares_core::job`	Exponential backoff: 1min → 5min → 30min → 60min (capped)
`WorkerConfig`	`ares_core::job`	Worker settings: poll_interval, retry_config, skip_unchanged
`AppError`	`ares_core::error`	Error enum with `is_retryable()` and `should_trip_circuit()`
`SchemaResolver`	`ares_core::schema`	CRUD for schemas: resolve, create, update, delete + registry management
`CircuitBreaker`	`ares_core::circuit_breaker`	Closed → Open → HalfOpen state machine
`ThrottledFetcher<F>`	`ares_core::throttle`	Per-domain delay with jitter

Quick Start (Library Usage)

use ares_client::{ReqwestFetcher, HtmdCleaner, OpenAiExtractor};
use ares_core::{ScrapeService, NullStore};

let fetcher = ReqwestFetcher::new()?;
let cleaner = HtmdCleaner::new();
let extractor = OpenAiExtractor::with_base_url(&api_key, "gpt-4o-mini", "https://api.openai.com/v1")?;

let service = ScrapeService::<_, _, _, NullStore>::new(fetcher, cleaner, extractor, "gpt-4o-mini".into());

let schema = serde_json::json!({
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "author": {"type": "string"}
    },
    "required": ["title", "author"]
});

let result = service.scrape("https://example.com/blog", &schema, "blog").await?;
println!("{}", serde_json::to_string_pretty(&result.extracted_data)?);

With persistence, use ScrapeService::with_store(fetcher, cleaner, extractor, store, model).

Reference Guides

Topic	File	When to Read
Architecture deep-dive	`references/architecture.md`	Understanding pipeline internals, crate dependencies, resilience patterns
JSON Schema system	`references/schemas.md`	Creating/managing schemas, registry, versioning
Extending Ares	`references/extending.md`	Implementing custom Fetcher/Cleaner/Extractor/Store/JobQueue
CLI & REST API	`references/cli-and-server.md`	Running CLI commands, calling API endpoints, deploying
Contributing	`references/contributing.md`	Dev setup, testing, CI, code style

Version Notes

Current version: 0.1.0
crates.io release: Scheduled for February 29, 2026
Until then, use git dependency: ares-core = { git = "https://github.com/AndreaBozzo/Ares" }
Works with any OpenAI-compatible API (OpenAI, Gemini, etc.)
Browser support requires feature flag: --features browser

Source

git clone https://github.com/AndreaBozzo/Ares-Claude-Skill/blob/main/ares/SKILL.mdView on GitHub

Overview

Ares is a Rust library, CLI, and HTTP server that extracts structured data from websites using LLMs and JSON Schemas. It offers a modular pipeline (fetcher, cleaner, extractor) and supports REST API, schema creation, deployment, and contributor workflows for a flexible, extensible scraping solution.

How This Skill Works

The scraping pipeline runs URL → Fetcher (HTML) → Cleaner (Markdown) → Extractor (LLM + JSON Schema) → Hash → Compare → Store. Each stage is a trait (Fetcher, Cleaner, Extractor) so components can be swapped or mocked, with supporting crates for core logic, HTTP clients, storage, API, and CLI.

When to Use It

Build a schema-driven scraper that outputs structured JSON according to a defined schema.
Expose scraped data through a REST API using the ares-api crate and Axum routes.
Extend the pipeline with custom fetchers, cleaners, or extractors for domain-specific needs.
Manage and deploy scraping jobs with the CLI and persistent storage (history, jobs, workers).
Collaborate on the Ares codebase by adding new schemas, components, or contributors guidance.

Quick Start

Step 1: Add and import components (ReqwestFetcher, HtmdCleaner, OpenAiExtractor) from the ares crates.
Step 2: Instantiate the components: let fetcher = ReqwestFetcher::new()?; let cleaner = HtmdCleaner::new(); let extractor = OpenAiExtractor::with_base_url("https://api.openai.com/")?;
Step 3: Wire the pipeline and run a scrape: let service = ScrapeService::new(fetcher, cleaner, extractor, NullStore); service.scrape("https://example.com", "ProductSchema").await?;

Best Practices

Define and validate clear JSON Schemas for target sites before implementing extraction.
Use dependency injection to swap Fetcher, Cleaner, and Extractor during testing.
Leverage ExtractionStore to cache results and access history for better accuracy.
Test end-to-end pipelines with mock services and isolated components prior to deployment.
Enable throttling and circuit breaking (ThrottledFetcher, CircuitBreaker) to handle load and reliability.

Example Use Cases

Scrape product data (title, price, availability) from an e-commerce site into a normalized JSON schema.
Aggregate job postings (title, company, location, salary) from career portals using ares-core traits.
Collect restaurant reviews and ratings and normalize into a structured schema for analytics.
Persist extractions in PostgreSQL via ares-db and serve results through the ares-api REST API.
Use the ares-cli to run scraping jobs, inspect history, and monitor worker status.

Frequently Asked Questions

Add this skill to your agents