How do I start a crawl?

Set the site type, update settings via update_settings, then run start_crawl with the target URL(s).

How can I monitor progress?

Poll get_crawl_status (optionally with since) to detect new changes and completed pages.

How do I retry failed pages?

Use recrawl_urls with the list of failed page URLs and re-run the crawl.

crawl-site

npx machina-cli add skill Crawlio-app/crawlio-plugin/crawl-site --openclaw

Files (1)

SKILL.md

3.2 KB

crawl-site

Crawl a website using Crawlio. Configures settings based on site type, starts the crawl, monitors progress, and reports results.

When to Use

Use this skill when the user wants to download, mirror, or crawl a website for offline access, analysis, or archival.

Workflow

1. Determine Site Type

Before configuring settings, identify the site type. Ask the user or infer from context:

Site Type	Indicators	Recommended Settings
Static site	HTML/CSS, no JS frameworks	`maxDepth: 5`, `maxConcurrent: 8`
SPA (React, Vue, etc.)	JS-heavy, client-side routing	`maxDepth: 3`, `includeSupportingFiles: true`, consider using crawlio-agent for enrichment first
CMS (WordPress, etc.)	`/wp-content/`, admin paths	`maxDepth: 5`, `excludePatterns: ["/wp-admin/", "/wp-json/"]`
Documentation site	`/docs/`, versioned paths	`maxDepth: 10`, `excludePatterns: ["/v[0-9]/"]` for old versions
Single page snapshot	User wants just one page	`maxDepth: 0`, `includeSupportingFiles: true`

2. Configure Settings

Use update_settings to set appropriate configuration:

update_settings({
  settings: {
    maxConcurrent: 4,      // Parallel downloads (increase for large sites)
    crawlDelay: 0.5,       // Be polite — seconds between requests
    timeout: 60,           // Request timeout
    stripTrackingParams: true
  },
  policy: {
    scopeMode: "sameDomain",
    maxDepth: 5,
    respectRobotsTxt: true,
    includeSupportingFiles: true,
    downloadCrossDomainAssets: true,  // Get CDN assets
    autoUpgradeHTTP: true             // Use HTTPS
  }
})

3. Start the Crawl

start_crawl({ url: "https://example.com" })

For multi-page targeted downloads:

start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] })

4. Monitor Progress

Poll get_crawl_status with the sequence number for efficient change detection:

get_crawl_status()
// Returns: seq: 42, downloaded: 85/150

get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status

5. Check for Issues

After crawl completes:

get_failed_urls()     // Any failures to retry?
get_errors()          // Any engine errors?
get_site_tree()       // What was downloaded?

6. Retry Failures (if any)

recrawl_urls({ urls: ["https://example.com/failed-page"] })

7. Report Results

Summarize: pages downloaded, failures, site structure, any notable findings.

Tips

For large sites (1000+ pages), set maxPagesPerCrawl to avoid runaway crawls
Use excludePatterns to skip known junk paths (admin panels, API routes, search results)
If a site requires authentication, set customCookies or customHeaders in settings
For SPA sites, combine with the crawlio-agent Chrome extension for framework detection and JavaScript-rendered content

Source

git clone https://github.com/Crawlio-app/crawlio-plugin/blob/main/skills/crawl-site/SKILL.mdView on GitHub

Overview

Automates site-wide crawling with Crawlio to download, mirror, or analyze a website for offline access. It auto-configures settings by site type (static, SPA, CMS, docs, or single-page snapshot), initiates the crawl, tracks progress, and reports results.

How This Skill Works

First, identify the site type and apply recommended settings. Then call update_settings with parallelism, delays, timeouts, and a policy (scopeMode, maxDepth, robots, and asset options). Start the crawl with start_crawl (url or multiple urls), and monitor progress via get_crawl_status. After completion, inspect results with get_site_tree or fetch issues with get_failed_urls and get_errors, and retry via recrawl_urls if needed.

When to Use It

You want to download or mirror a static site for offline access or archival.
You want to crawl a SPA (React, Vue, etc.) to analyze client-rendered content.
You want to crawl a CMS site (e.g., WordPress) and exclude admin or API paths.
You want to crawl a documentation site with versioned paths for offline docs.
You want a quick single-page snapshot (maxDepth 0) for offline review.

Quick Start

Step 1: Determine site type and call update_settings with appropriate policy for maxDepth, includeSupportingFiles, and excludePatterns.
Step 2: Start the crawl: start_crawl({ url: "https://example.com" }) or start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] }).
Step 3: Monitor progress with get_crawl_status and review results with get_site_tree or get_errors.

Best Practices

Identify the site type first to apply appropriate maxDepth and include/exclude patterns.
Use includeSupportingFiles for SPA or assets to ensure assets and metadata are captured.
Skip junk paths with excludePatterns like /wp-admin/* and /wp-json/*.
Respect robots.txt and tune crawlDelay, timeout, and maxConcurrent for site load.
For large sites, set maxPagesPerCrawl and consider recrawling failed URLs after a retry.

Example Use Cases

Mirror a static corporate site for offline archival and internal demos.
Crawl a React SPA to collect all client-rendered routes and assets.
Back up a WordPress site while excluding admin and JSON endpoints.
Download a versioned documentation site, skipping old versions.
Capture a single page snapshot for offline review.

Frequently Asked Questions

Add this skill to your agents