Get the FREE Ultimate OpenClaw Setup Guide →

Ares

Scanned
npx machina-cli add skill AndreaBozzo/Ares-Claude-Skill/ares --openclaw
Files (1)
SKILL.md
5.4 KB

Ares — LLM-Powered Web Scraper

Ares is a Rust library, CLI, and HTTP server that extracts structured data from websites using LLMs and JSON Schemas.

Repository: https://github.com/AndreaBozzo/Ares License: Apache-2.0 | Rust edition: 2024 | MSRV: 1.88+

Pipeline

URL → Fetcher (HTML) → Cleaner (Markdown) → Extractor (LLM + JSON Schema) → Hash → Compare → Store

Each stage is a trait, so every component can be swapped or mocked independently.

Crate Map

CratePurposeKey Exports
ares-coreBusiness logic, traits, pipelineScrapeService, WorkerService, CircuitBreaker, ThrottledFetcher, traits
ares-clientHTTP/browser fetchers, cleaner, LLM clientReqwestFetcher, BrowserFetcher, HtmdCleaner, OpenAiExtractor
ares-dbPostgreSQL persistenceDatabase, ExtractionRepository, ScrapeJobRepository
ares-apiAxum REST APIRoutes, DTOs, bearer auth, OpenAPI/Swagger
ares-cliCommand-line interfacescrape, history, job, worker subcommands

Core Traits (ares-core::traits)

pub trait Fetcher: Send + Sync + Clone {
    fn fetch(&self, url: &str) -> impl Future<Output = Result<String, AppError>> + Send;
}

pub trait Cleaner: Send + Sync + Clone {
    fn clean(&self, html: &str) -> Result<String, AppError>;
}

pub trait Extractor: Send + Sync + Clone {
    fn extract(&self, content: &str, schema: &serde_json::Value)
        -> impl Future<Output = Result<serde_json::Value, AppError>> + Send;
}

pub trait ExtractorFactory: Send + Sync + Clone {
    type Extractor: Extractor;
    fn create(&self, model: &str, base_url: &str) -> Result<Self::Extractor, AppError>;
}

pub trait ExtractionStore: Send + Sync + Clone {
    fn save(&self, extraction: &NewExtraction) -> impl Future<Output = Result<Uuid, AppError>> + Send;
    fn get_latest(&self, url: &str, schema_name: &str) -> impl Future<Output = Result<Option<Extraction>, AppError>> + Send;
    fn get_history(&self, url: &str, schema_name: &str, limit: usize, offset: usize) -> impl Future<Output = Result<Vec<Extraction>, AppError>> + Send;
}

JobQueue trait: see ares-core::job_queue — persistent queue with atomic claiming (SELECT FOR UPDATE SKIP LOCKED).

Key Types

TypeModulePurpose
Extractionares_core::modelsCompleted extraction (id, url, schema_name, extracted_data, hashes, model, created_at)
NewExtractionares_core::modelsInsert DTO (no id/timestamps)
ScrapeResultares_core::modelsPipeline output (extracted_data, hashes, changed flag, extraction_id)
ScrapeJobares_core::jobQueued job with status, retry info, LLM config
JobStatusares_core::jobEnum: Pending, Running, Completed, Failed, Cancelled
RetryConfigares_core::jobExponential backoff: 1min → 5min → 30min → 60min (capped)
WorkerConfigares_core::jobWorker settings: poll_interval, retry_config, skip_unchanged
AppErrorares_core::errorError enum with is_retryable() and should_trip_circuit()
SchemaResolverares_core::schemaCRUD for schemas: resolve, create, update, delete + registry management
CircuitBreakerares_core::circuit_breakerClosed → Open → HalfOpen state machine
ThrottledFetcher<F>ares_core::throttlePer-domain delay with jitter

Quick Start (Library Usage)

use ares_client::{ReqwestFetcher, HtmdCleaner, OpenAiExtractor};
use ares_core::{ScrapeService, NullStore};

let fetcher = ReqwestFetcher::new()?;
let cleaner = HtmdCleaner::new();
let extractor = OpenAiExtractor::with_base_url(&api_key, "gpt-4o-mini", "https://api.openai.com/v1")?;

let service = ScrapeService::<_, _, _, NullStore>::new(fetcher, cleaner, extractor, "gpt-4o-mini".into());

let schema = serde_json::json!({
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "author": {"type": "string"}
    },
    "required": ["title", "author"]
});

let result = service.scrape("https://example.com/blog", &schema, "blog").await?;
println!("{}", serde_json::to_string_pretty(&result.extracted_data)?);

With persistence, use ScrapeService::with_store(fetcher, cleaner, extractor, store, model).

Reference Guides

TopicFileWhen to Read
Architecture deep-divereferences/architecture.mdUnderstanding pipeline internals, crate dependencies, resilience patterns
JSON Schema systemreferences/schemas.mdCreating/managing schemas, registry, versioning
Extending Aresreferences/extending.mdImplementing custom Fetcher/Cleaner/Extractor/Store/JobQueue
CLI & REST APIreferences/cli-and-server.mdRunning CLI commands, calling API endpoints, deploying
Contributingreferences/contributing.mdDev setup, testing, CI, code style

Version Notes

  • Current version: 0.1.0
  • crates.io release: Scheduled for February 29, 2026
  • Until then, use git dependency: ares-core = { git = "https://github.com/AndreaBozzo/Ares" }
  • Works with any OpenAI-compatible API (OpenAI, Gemini, etc.)
  • Browser support requires feature flag: --features browser

Source

git clone https://github.com/AndreaBozzo/Ares-Claude-Skill/blob/main/ares/SKILL.mdView on GitHub

Overview

Ares is a Rust library, CLI, and HTTP server that extracts structured data from websites using LLMs and JSON Schemas. It offers a modular pipeline (fetcher, cleaner, extractor) and supports REST API, schema creation, deployment, and contributor workflows for a flexible, extensible scraping solution.

How This Skill Works

The scraping pipeline runs URL → Fetcher (HTML) → Cleaner (Markdown) → Extractor (LLM + JSON Schema) → Hash → Compare → Store. Each stage is a trait (Fetcher, Cleaner, Extractor) so components can be swapped or mocked, with supporting crates for core logic, HTTP clients, storage, API, and CLI.

When to Use It

  • Build a schema-driven scraper that outputs structured JSON according to a defined schema.
  • Expose scraped data through a REST API using the ares-api crate and Axum routes.
  • Extend the pipeline with custom fetchers, cleaners, or extractors for domain-specific needs.
  • Manage and deploy scraping jobs with the CLI and persistent storage (history, jobs, workers).
  • Collaborate on the Ares codebase by adding new schemas, components, or contributors guidance.

Quick Start

  1. Step 1: Add and import components (ReqwestFetcher, HtmdCleaner, OpenAiExtractor) from the ares crates.
  2. Step 2: Instantiate the components: let fetcher = ReqwestFetcher::new()?; let cleaner = HtmdCleaner::new(); let extractor = OpenAiExtractor::with_base_url("https://api.openai.com/")?;
  3. Step 3: Wire the pipeline and run a scrape: let service = ScrapeService::new(fetcher, cleaner, extractor, NullStore); service.scrape("https://example.com", "ProductSchema").await?;

Best Practices

  • Define and validate clear JSON Schemas for target sites before implementing extraction.
  • Use dependency injection to swap Fetcher, Cleaner, and Extractor during testing.
  • Leverage ExtractionStore to cache results and access history for better accuracy.
  • Test end-to-end pipelines with mock services and isolated components prior to deployment.
  • Enable throttling and circuit breaking (ThrottledFetcher, CircuitBreaker) to handle load and reliability.

Example Use Cases

  • Scrape product data (title, price, availability) from an e-commerce site into a normalized JSON schema.
  • Aggregate job postings (title, company, location, salary) from career portals using ares-core traits.
  • Collect restaurant reviews and ratings and normalize into a structured schema for analytics.
  • Persist extractions in PostgreSQL via ares-db and serve results through the ares-api REST API.
  • Use the ares-cli to run scraping jobs, inspect history, and monitor worker status.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers