Get the FREE Ultimate OpenClaw Setup Guide →

web-scrape

Scanned
npx machina-cli add skill aiskillstore/marketplace/web-scrape --openclaw
Files (1)
SKILL.md
5.8 KB

Web Scraping Skill v3.0

Usage

/web-scrape <url> [options]

Options:

  • --format=markdown|json|text - Output format (default: markdown)
  • --full - Include full page content (skip smart extraction)
  • --screenshot - Also save a screenshot
  • --scroll - Scroll to load dynamic content (infinite scroll pages)

Examples:

/web-scrape https://example.com/article
/web-scrape https://news.site.com/story --format=json
/web-scrape https://spa-app.com/page --scroll --screenshot

Execution Flow

Phase 1: Navigate and Load

1. mcp__playwright__browser_navigate
   url: "<target URL>"

2. mcp__playwright__browser_wait_for
   time: 2  (allow initial render)

If --scroll option: Execute scroll sequence to trigger lazy loading:

3. mcp__playwright__browser_evaluate
   function: "async () => {
     for (let i = 0; i < 3; i++) {
       window.scrollTo(0, document.body.scrollHeight);
       await new Promise(r => setTimeout(r, 1000));
     }
     window.scrollTo(0, 0);
   }"

Phase 2: Capture Content

4. mcp__playwright__browser_snapshot
   → Returns full accessibility tree with all text content

If --screenshot option:

5. mcp__playwright__browser_take_screenshot
   filename: "scraped_<domain>_<timestamp>.png"
   fullPage: true

Phase 3: Close Browser

6. mcp__playwright__browser_close

Smart Content Extraction

After getting the snapshot, apply intelligent extraction:

Step 1: Identify Content Type

Page TypeIndicatorsExtraction Strategy
Article/Blog<article>, long paragraphs, date/authorExtract main article body
Product PagePrice, "Add to Cart", specsExtract title, price, description, specs
DocumentationCode blocks, headings hierarchyPreserve structure and code
List/SearchRepeated item patternsExtract as structured list
Landing PageHero section, CTAsExtract key messaging

Step 2: Filter Noise

ALWAYS REMOVE these elements from output:

  • Navigation menus and breadcrumbs
  • Footer content (copyright, links)
  • Sidebars (ads, related articles, social links)
  • Cookie banners and popups
  • Comments section (unless specifically requested)
  • Share buttons and social widgets
  • Login/signup prompts

Step 3: Structure the Content

For Articles:

# [Title]

**Source:** [URL]
**Date:** [if available]
**Author:** [if available]

---

[Main content in clean markdown]

For Product Pages:

# [Product Name]

**Price:** [price]
**Availability:** [in stock/out of stock]

## Description
[product description]

## Specifications
| Spec | Value |
|------|-------|
| ... | ... |

Output Formats

Markdown (default)

Clean, readable markdown with proper headings, lists, and formatting.

JSON

{
  "url": "https://...",
  "title": "Page Title",
  "type": "article|product|docs|list",
  "content": {
    "main": "...",
    "metadata": {}
  },
  "extracted_at": "ISO timestamp"
}

Text

Plain text with minimal formatting, suitable for further processing.


Error Handling

Navigation Errors

ErrorDetectionAction
TimeoutPage doesn't load in 30sReport error, suggest retry
404 Not Found"404" in title/contentReport "Page not found"
403 Forbidden"403", "Access Denied"Report access restriction
CAPTCHA"captcha", "verify you're human"Report CAPTCHA detected, cannot proceed
Paywall"subscribe", "premium content"Extract visible content, note paywall

Recovery Actions

If page load fails:
1. Report the specific error to user
2. Suggest: "Try again?" or "Different URL?"
3. Close browser cleanly

If content is blocked:
1. Report what was detected (CAPTCHA/paywall/geo-block)
2. Extract any available preview content
3. Suggest alternatives if applicable

Advanced Scenarios

Single Page Applications (SPA)

1. Navigate to URL
2. Wait longer (3-5 seconds) for JS hydration
3. Use browser_wait_for with specific text if known
4. Then snapshot

Infinite Scroll Pages

1. Navigate
2. Execute scroll loop (see Phase 1)
3. Snapshot after scrolling completes

Pages with Click-to-Reveal Content

1. Snapshot first to identify clickable elements
2. Use browser_click on "Read more" / "Show all" buttons
3. Wait briefly
4. Snapshot again for full content

Multi-page Articles

1. Scrape first page
2. Identify "Next" or pagination links
3. Ask user: "Article has X pages. Scrape all?"
4. If yes, iterate through pages and combine

Performance Guidelines

MetricTargetHow
Speed< 15 secondsMinimal waits, parallel where possible
Token Usage< 5000 tokensSmart extraction, not full DOM
Reliability> 95% successProper error handling

Security Notes

  • Never execute arbitrary JavaScript from the page
  • Don't follow redirects to suspicious domains
  • Don't submit forms or click login buttons
  • Don't scrape pages that require authentication (unless user provides credentials flow)
  • Respect robots.txt when mentioned by user

Quick Reference

Minimum viable scrape (4 tool calls):

1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close

Full-featured scrape (with scroll + screenshot):

1. browser_navigate
2. browser_wait_for
3. browser_evaluate (scroll)
4. browser_snapshot
5. browser_take_screenshot
6. browser_close

Remember: The goal is to deliver clean, useful content to the user, not raw HTML/DOM dumps.

Source

git clone https://github.com/aiskillstore/marketplace/blob/main/skills/21pounder/web-scrape/SKILL.mdView on GitHub

Overview

This web-scrape skill navigates to a target URL, optionally scrolls to load dynamic content, captures page content, and applies smart extraction to produce structured output. It supports markdown, JSON, or plain text formats, with optional screenshots and robust error handling.

How This Skill Works

Using Playwright, it navigates to the target URL, waits for rendering, and scrolls when requested to trigger lazy loading. It then captures a full content snapshot (with an optional screenshot), runs smart extraction to remove noise and structure the output based on content type, and formats the result in the chosen output format.

When to Use It

  • You need the main article body from a blog or news page.
  • You want product title, price, description, and specs from an e-commerce page.
  • You need to preserve code blocks and headings from documentation pages.
  • You must extract structured lists from search results or listings.
  • You want to summarize key messaging from a landing page with CTAs.

Quick Start

  1. Step 1: Run /web-scrape <url> [options]
  2. Step 2: Choose --format and optional flags like --scroll, --screenshot, or --full
  3. Step 3: Retrieve and review the output in the chosen format

Best Practices

  • Provide a stable URL that’s publicly accessible.
  • Choose the correct format (--format) and options (--scroll, --screenshot, --full) for the target page.
  • Enable --scroll for pages with infinite scrolling to load content.
  • Review and validate the extracted structure against the page type (article/product/docs/list/landing).
  • Rely on the noise-removal rules (menus, footers, sidebars, cookie banners, etc.) and consider --full only if you need full page content.

Example Use Cases

  • Extract a blog article's title, author, date, and main body.
  • Capture product name, price, description, and specs from a shopping page.
  • Preserve code blocks and headings from a software documentation page.
  • Pull a list of items with titles and summaries from a category or search results page.
  • Summarize the hero messaging from a marketing landing page.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers