What output formats does Data Scraper support?

Output can be text, JSON, CSV, or Markdown, selected with the --format flag.

How can I respect rate limits and robots.txt?

Defaults restrict to 1 request per second per domain; use --delay or --polite modes and the watch/diff features to control cadence; robots.txt is respected when --polite is set.

How do I handle authentication or headers?

Use --header, --cookie, and --ua to customize requests and handle authenticated pages.

Data Scraper

Scanned

@mupengi-bot

npx machina-cli add skill @mupengi-bot/data-scraper --openclaw

Files (1)

SKILL.md

3.4 KB

data-scraper

Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.

When to Use

Extract text content from web pages (articles, blogs, docs)
Scrape product prices, reviews, or listings
Monitor pages for changes (price drops, new content)
Batch-collect data from multiple URLs
Convert HTML tables to structured formats (JSON/CSV)

Quick Start

# Extract readable text from URL
data-scraper fetch "https://example.com/article"

# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"

# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600

Extraction Modes

Text Mode (default)

Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.

data-scraper fetch URL
# Output: clean markdown text

Selector Mode

Target specific CSS selectors for precise extraction.

data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data

Table Mode

Extract HTML tables into structured formats.

data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)

Link Mode

Extract all links from a page with optional filtering.

data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs

Batch Scraping

# Scrape multiple URLs
data-scraper batch urls.txt --output results/

# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/

urls.txt format:

https://site1.com/page
https://site2.com/page
https://site3.com/page

Change Monitoring

# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600

# Compare with previous snapshot
data-scraper diff URL

Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.

Output Formats

Format	Flag	Use Case
Text	`--format text`	Reading, summarization
JSON	`--format json`	Data processing
CSV	`--format csv`	Spreadsheets
Markdown	`--format md`	Documentation

Headers & Auth

# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"

# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"

# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."

Rate Limiting & Ethics

Default: 1 request per second per domain
Respects robots.txt when --polite flag is set
Configurable delay between requests
Stops on 429 (Too Many Requests) and backs off

Error Handling

Error	Behavior
404	Log and skip
403/401	Warn about auth requirement
429	Exponential backoff (max 3 retries)
Timeout	Retry once with longer timeout
SSL error	Warn, option to proceed with `--insecure`

Integration

web-claude: Use as fallback when web_fetch isn't enough
competitor-watch: Feed scraped data into competitor analysis
seo-audit: Scrape competitor pages for SEO comparison
performance-tracker: Collect social metrics from public profiles

Source

git clone https://clawhub.ai/mupengi-bot/data-scraperView on GitHub

Overview

Data Scraper fetches and extracts structured data from web pages without a browser. It supports plain text extraction, HTML-to-text conversion, table extraction, price monitoring, and batch scraping across multiple URLs, making data collection faster and more repeatable.

How This Skill Works

Data Scraper fetches a page and then applies one or more extraction modes. Text Mode returns readable content by stripping HTML; Selector Mode targets CSS selectors for precise data; Table Mode exports HTML tables as JSON/CSV. Outputs support text, JSON, CSV, and Markdown for downstream use.

When to Use It

Extract text content from web pages such as articles, blogs, and docs
Scrape product prices, reviews, or listings
Monitor pages for changes like price drops or new content
Batch-collect data from multiple URLs
Convert HTML tables to structured formats (JSON/CSV)

Quick Start

Step 1: data-scraper fetch "https://example.com/article"
Step 2: data-scraper extract "https://example.com" --selector "h2, .price"
Step 3: data-scraper watch "https://example.com/product" --interval 3600

Best Practices

Start with Text Mode to grab article content before refining with selectors
Use Selector Mode for precise fields (e.g., .title, .price, .rating)
Leverage Table Mode for feeding data into JSON/CSV workflows
Apply batch scraping with --delay or watch intervals to respect rate limits
Choose the appropriate output format (--format) to fit your data pipeline

Example Use Cases

Extract readable article text from a blog post URL
Capture product prices and ratings from an e-commerce product page
Monitor a listing page for price changes and new items
Batch-scrape URLs from a list into JSON for analysis
Convert catalog tables into JSON or CSV for inventory tracking

Frequently Asked Questions

Add this skill to your agents