Get the FREE Ultimate OpenClaw Setup Guide →

crawl-site

npx machina-cli add skill Crawlio-app/crawlio-plugin/crawl-site --openclaw
Files (1)
SKILL.md
3.2 KB

crawl-site

Crawl a website using Crawlio. Configures settings based on site type, starts the crawl, monitors progress, and reports results.

When to Use

Use this skill when the user wants to download, mirror, or crawl a website for offline access, analysis, or archival.

Workflow

1. Determine Site Type

Before configuring settings, identify the site type. Ask the user or infer from context:

Site TypeIndicatorsRecommended Settings
Static siteHTML/CSS, no JS frameworksmaxDepth: 5, maxConcurrent: 8
SPA (React, Vue, etc.)JS-heavy, client-side routingmaxDepth: 3, includeSupportingFiles: true, consider using crawlio-agent for enrichment first
CMS (WordPress, etc.)/wp-content/, admin pathsmaxDepth: 5, excludePatterns: ["/wp-admin/*", "/wp-json/*"]
Documentation site/docs/, versioned pathsmaxDepth: 10, excludePatterns: ["/v[0-9]*/*"] for old versions
Single page snapshotUser wants just one pagemaxDepth: 0, includeSupportingFiles: true

2. Configure Settings

Use update_settings to set appropriate configuration:

update_settings({
  settings: {
    maxConcurrent: 4,      // Parallel downloads (increase for large sites)
    crawlDelay: 0.5,       // Be polite — seconds between requests
    timeout: 60,           // Request timeout
    stripTrackingParams: true
  },
  policy: {
    scopeMode: "sameDomain",
    maxDepth: 5,
    respectRobotsTxt: true,
    includeSupportingFiles: true,
    downloadCrossDomainAssets: true,  // Get CDN assets
    autoUpgradeHTTP: true             // Use HTTPS
  }
})

3. Start the Crawl

start_crawl({ url: "https://example.com" })

For multi-page targeted downloads:

start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] })

4. Monitor Progress

Poll get_crawl_status with the sequence number for efficient change detection:

get_crawl_status()
// Returns: seq: 42, downloaded: 85/150

get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status

5. Check for Issues

After crawl completes:

get_failed_urls()     // Any failures to retry?
get_errors()          // Any engine errors?
get_site_tree()       // What was downloaded?

6. Retry Failures (if any)

recrawl_urls({ urls: ["https://example.com/failed-page"] })

7. Report Results

Summarize: pages downloaded, failures, site structure, any notable findings.

Tips

  • For large sites (1000+ pages), set maxPagesPerCrawl to avoid runaway crawls
  • Use excludePatterns to skip known junk paths (admin panels, API routes, search results)
  • If a site requires authentication, set customCookies or customHeaders in settings
  • For SPA sites, combine with the crawlio-agent Chrome extension for framework detection and JavaScript-rendered content

Source

git clone https://github.com/Crawlio-app/crawlio-plugin/blob/main/skills/crawl-site/SKILL.mdView on GitHub

Overview

Automates site-wide crawling with Crawlio to download, mirror, or analyze a website for offline access. It auto-configures settings by site type (static, SPA, CMS, docs, or single-page snapshot), initiates the crawl, tracks progress, and reports results.

How This Skill Works

First, identify the site type and apply recommended settings. Then call update_settings with parallelism, delays, timeouts, and a policy (scopeMode, maxDepth, robots, and asset options). Start the crawl with start_crawl (url or multiple urls), and monitor progress via get_crawl_status. After completion, inspect results with get_site_tree or fetch issues with get_failed_urls and get_errors, and retry via recrawl_urls if needed.

When to Use It

  • You want to download or mirror a static site for offline access or archival.
  • You want to crawl a SPA (React, Vue, etc.) to analyze client-rendered content.
  • You want to crawl a CMS site (e.g., WordPress) and exclude admin or API paths.
  • You want to crawl a documentation site with versioned paths for offline docs.
  • You want a quick single-page snapshot (maxDepth 0) for offline review.

Quick Start

  1. Step 1: Determine site type and call update_settings with appropriate policy for maxDepth, includeSupportingFiles, and excludePatterns.
  2. Step 2: Start the crawl: start_crawl({ url: "https://example.com" }) or start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] }).
  3. Step 3: Monitor progress with get_crawl_status and review results with get_site_tree or get_errors.

Best Practices

  • Identify the site type first to apply appropriate maxDepth and include/exclude patterns.
  • Use includeSupportingFiles for SPA or assets to ensure assets and metadata are captured.
  • Skip junk paths with excludePatterns like /wp-admin/* and /wp-json/*.
  • Respect robots.txt and tune crawlDelay, timeout, and maxConcurrent for site load.
  • For large sites, set maxPagesPerCrawl and consider recrawling failed URLs after a retry.

Example Use Cases

  • Mirror a static corporate site for offline archival and internal demos.
  • Crawl a React SPA to collect all client-rendered routes and assets.
  • Back up a WordPress site while excluding admin and JSON endpoints.
  • Download a versioned documentation site, skipping old versions.
  • Capture a single page snapshot for offline review.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers