crawl-site
npx machina-cli add skill Crawlio-app/crawlio-plugin/crawl-site --openclawcrawl-site
Crawl a website using Crawlio. Configures settings based on site type, starts the crawl, monitors progress, and reports results.
When to Use
Use this skill when the user wants to download, mirror, or crawl a website for offline access, analysis, or archival.
Workflow
1. Determine Site Type
Before configuring settings, identify the site type. Ask the user or infer from context:
| Site Type | Indicators | Recommended Settings |
|---|---|---|
| Static site | HTML/CSS, no JS frameworks | maxDepth: 5, maxConcurrent: 8 |
| SPA (React, Vue, etc.) | JS-heavy, client-side routing | maxDepth: 3, includeSupportingFiles: true, consider using crawlio-agent for enrichment first |
| CMS (WordPress, etc.) | /wp-content/, admin paths | maxDepth: 5, excludePatterns: ["/wp-admin/*", "/wp-json/*"] |
| Documentation site | /docs/, versioned paths | maxDepth: 10, excludePatterns: ["/v[0-9]*/*"] for old versions |
| Single page snapshot | User wants just one page | maxDepth: 0, includeSupportingFiles: true |
2. Configure Settings
Use update_settings to set appropriate configuration:
update_settings({
settings: {
maxConcurrent: 4, // Parallel downloads (increase for large sites)
crawlDelay: 0.5, // Be polite — seconds between requests
timeout: 60, // Request timeout
stripTrackingParams: true
},
policy: {
scopeMode: "sameDomain",
maxDepth: 5,
respectRobotsTxt: true,
includeSupportingFiles: true,
downloadCrossDomainAssets: true, // Get CDN assets
autoUpgradeHTTP: true // Use HTTPS
}
})
3. Start the Crawl
start_crawl({ url: "https://example.com" })
For multi-page targeted downloads:
start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] })
4. Monitor Progress
Poll get_crawl_status with the sequence number for efficient change detection:
get_crawl_status()
// Returns: seq: 42, downloaded: 85/150
get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status
5. Check for Issues
After crawl completes:
get_failed_urls() // Any failures to retry?
get_errors() // Any engine errors?
get_site_tree() // What was downloaded?
6. Retry Failures (if any)
recrawl_urls({ urls: ["https://example.com/failed-page"] })
7. Report Results
Summarize: pages downloaded, failures, site structure, any notable findings.
Tips
- For large sites (1000+ pages), set
maxPagesPerCrawlto avoid runaway crawls - Use
excludePatternsto skip known junk paths (admin panels, API routes, search results) - If a site requires authentication, set
customCookiesorcustomHeadersin settings - For SPA sites, combine with the crawlio-agent Chrome extension for framework detection and JavaScript-rendered content
Source
git clone https://github.com/Crawlio-app/crawlio-plugin/blob/main/skills/crawl-site/SKILL.mdView on GitHub Overview
Automates site-wide crawling with Crawlio to download, mirror, or analyze a website for offline access. It auto-configures settings by site type (static, SPA, CMS, docs, or single-page snapshot), initiates the crawl, tracks progress, and reports results.
How This Skill Works
First, identify the site type and apply recommended settings. Then call update_settings with parallelism, delays, timeouts, and a policy (scopeMode, maxDepth, robots, and asset options). Start the crawl with start_crawl (url or multiple urls), and monitor progress via get_crawl_status. After completion, inspect results with get_site_tree or fetch issues with get_failed_urls and get_errors, and retry via recrawl_urls if needed.
When to Use It
- You want to download or mirror a static site for offline access or archival.
- You want to crawl a SPA (React, Vue, etc.) to analyze client-rendered content.
- You want to crawl a CMS site (e.g., WordPress) and exclude admin or API paths.
- You want to crawl a documentation site with versioned paths for offline docs.
- You want a quick single-page snapshot (maxDepth 0) for offline review.
Quick Start
- Step 1: Determine site type and call update_settings with appropriate policy for maxDepth, includeSupportingFiles, and excludePatterns.
- Step 2: Start the crawl: start_crawl({ url: "https://example.com" }) or start_crawl({ urls: ["https://example.com/page1", "https://example.com/page2"] }).
- Step 3: Monitor progress with get_crawl_status and review results with get_site_tree or get_errors.
Best Practices
- Identify the site type first to apply appropriate maxDepth and include/exclude patterns.
- Use includeSupportingFiles for SPA or assets to ensure assets and metadata are captured.
- Skip junk paths with excludePatterns like /wp-admin/* and /wp-json/*.
- Respect robots.txt and tune crawlDelay, timeout, and maxConcurrent for site load.
- For large sites, set maxPagesPerCrawl and consider recrawling failed URLs after a retry.
Example Use Cases
- Mirror a static corporate site for offline archival and internal demos.
- Crawl a React SPA to collect all client-rendered routes and assets.
- Back up a WordPress site while excluding admin and JSON endpoints.
- Download a versioned documentation site, skipping old versions.
- Capture a single page snapshot for offline review.