Get the FREE Ultimate OpenClaw Setup Guide →
m

Data Scraper

Scanned

@mupengi-bot

npx machina-cli add skill @mupengi-bot/data-scraper --openclaw
Files (1)
SKILL.md
3.4 KB

data-scraper

Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.

When to Use

  • Extract text content from web pages (articles, blogs, docs)
  • Scrape product prices, reviews, or listings
  • Monitor pages for changes (price drops, new content)
  • Batch-collect data from multiple URLs
  • Convert HTML tables to structured formats (JSON/CSV)

Quick Start

# Extract readable text from URL
data-scraper fetch "https://example.com/article"

# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"

# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600

Extraction Modes

Text Mode (default)

Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.

data-scraper fetch URL
# Output: clean markdown text

Selector Mode

Target specific CSS selectors for precise extraction.

data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data

Table Mode

Extract HTML tables into structured formats.

data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)

Link Mode

Extract all links from a page with optional filtering.

data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs

Batch Scraping

# Scrape multiple URLs
data-scraper batch urls.txt --output results/

# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/

urls.txt format:

https://site1.com/page
https://site2.com/page
https://site3.com/page

Change Monitoring

# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600

# Compare with previous snapshot
data-scraper diff URL

Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.

Output Formats

FormatFlagUse Case
Text--format textReading, summarization
JSON--format jsonData processing
CSV--format csvSpreadsheets
Markdown--format mdDocumentation

Headers & Auth

# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"

# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"

# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."

Rate Limiting & Ethics

  • Default: 1 request per second per domain
  • Respects robots.txt when --polite flag is set
  • Configurable delay between requests
  • Stops on 429 (Too Many Requests) and backs off

Error Handling

ErrorBehavior
404Log and skip
403/401Warn about auth requirement
429Exponential backoff (max 3 retries)
TimeoutRetry once with longer timeout
SSL errorWarn, option to proceed with --insecure

Integration

  • web-claude: Use as fallback when web_fetch isn't enough
  • competitor-watch: Feed scraped data into competitor analysis
  • seo-audit: Scrape competitor pages for SEO comparison
  • performance-tracker: Collect social metrics from public profiles

Source

git clone https://clawhub.ai/mupengi-bot/data-scraperView on GitHub

Overview

Data Scraper fetches and extracts structured data from web pages without a browser. It supports plain text extraction, HTML-to-text conversion, table extraction, price monitoring, and batch scraping across multiple URLs, making data collection faster and more repeatable.

How This Skill Works

Data Scraper fetches a page and then applies one or more extraction modes. Text Mode returns readable content by stripping HTML; Selector Mode targets CSS selectors for precise data; Table Mode exports HTML tables as JSON/CSV. Outputs support text, JSON, CSV, and Markdown for downstream use.

When to Use It

  • Extract text content from web pages such as articles, blogs, and docs
  • Scrape product prices, reviews, or listings
  • Monitor pages for changes like price drops or new content
  • Batch-collect data from multiple URLs
  • Convert HTML tables to structured formats (JSON/CSV)

Quick Start

  1. Step 1: data-scraper fetch "https://example.com/article"
  2. Step 2: data-scraper extract "https://example.com" --selector "h2, .price"
  3. Step 3: data-scraper watch "https://example.com/product" --interval 3600

Best Practices

  • Start with Text Mode to grab article content before refining with selectors
  • Use Selector Mode for precise fields (e.g., .title, .price, .rating)
  • Leverage Table Mode for feeding data into JSON/CSV workflows
  • Apply batch scraping with --delay or watch intervals to respect rate limits
  • Choose the appropriate output format (--format) to fit your data pipeline

Example Use Cases

  • Extract readable article text from a blog post URL
  • Capture product prices and ratings from an e-commerce product page
  • Monitor a listing page for price changes and new items
  • Batch-scrape URLs from a list into JSON for analysis
  • Convert catalog tables into JSON or CSV for inventory tracking

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers