extract-and-export
npx machina-cli add skill Crawlio-app/crawlio-plugin/extract-and-export --openclawextract-and-export
Complete crawl-extract-export pipeline. Crawls a site, extracts structured content (clean HTML, markdown, metadata, asset manifests), and exports in any of 7 formats.
When to Use
Use this skill when the user wants to download a site AND get usable output — not just a raw crawl, but extracted content ready for consumption, archival, or deployment.
For crawl-only workflows (no extraction or export), use crawl-site instead.
Arguments
$0(required): The URL to crawl$1(optional): Maximum crawl depth (default: 3)$2(optional): Export format (default:folder)
Export Formats
| Format | Description |
|---|---|
folder | Mirror on disk with original directory structure |
zip | Compressed archive, ready to share |
singleHTML | All assets inlined into a single HTML file |
warc | ISO 28500 web archive standard |
pdf | Rendered pages as portable document |
extracted | Structured data only — clean HTML, markdown, metadata, no raw assets |
deploy | Production-ready bundle with crawl-manifest.json |
Workflow
1. Configure Settings
update_settings({
settings: {
maxConcurrent: 4,
crawlDelay: 0.5,
stripTrackingParams: true
},
policy: {
scopeMode: "sameDomain",
maxDepth: $1 or 3,
respectRobotsTxt: true,
includeSupportingFiles: true,
downloadCrossDomainAssets: true,
autoUpgradeHTTP: true
}
})
Adjust based on site size:
- Small site (<100 pages):
maxDepth: 10,maxConcurrent: 8 - Medium site (100-1000):
maxDepth: 5,maxConcurrent: 4 - Large site (1000+):
maxDepth: 3,maxPagesPerCrawl: 500
2. Start the Crawl
start_crawl({ url: "$0" })
3. Monitor Progress
Poll get_crawl_status with since parameter for efficient change detection:
get_crawl_status()
// Returns: seq: 42, downloaded: 85/150
get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status
4. Check for Issues
After crawl completes:
get_failed_urls() // Any failures to retry?
get_errors() // Any engine errors?
Retry transient failures:
recrawl_urls({ urls: ["https://example.com/failed-page"] })
5. Review What Was Downloaded
get_site_tree() // File structure overview
get_downloads() // Detailed download info with content types
6. Extract Content
extract_site()
This runs the extraction pipeline and produces per-page artifacts:
- Clean HTML (tracking scripts removed)
- Markdown conversion
- Metadata (title, description, headings, links)
- Asset manifests
Poll get_extraction_status if the extraction takes time.
7. Export
export_site({ format: "$2" or "folder" })
Poll get_export_status for large exports.
8. Report Results
Summarize:
- Crawl: Total pages discovered, downloaded, failed
- Extraction: Pages processed, artifacts created
- Export: Format, location, file size
- Issues: Any errors or notable findings
Tips
- For archival workflows, use
warc— it's the ISO standard and preserves full HTTP headers - For AI consumption, use
extracted— just the structured data, no raw assets - For sharing, use
zip— compressed and portable - For deployment, use
deploy— includescrawl-manifest.jsonwith full metadata - For large sites, set
maxPagesPerCrawlto avoid runaway crawls - Save the project after export for future reference:
save_project({ name: "example.com export" })
Source
git clone https://github.com/Crawlio-app/crawlio-plugin/blob/main/skills/extract-and-export/SKILL.mdView on GitHub Overview
extract-and-export crawls a site, extracts structured content (clean HTML, markdown, metadata, asset manifests), and exports in seven formats. It yields artifacts ready for archival, deployment, or AI consumption.
How This Skill Works
Configure the crawl and export settings with update_settings, then start the crawl using start_crawl({ url: 'your-site-url' }). After crawling, run extract_site() to generate per-page artifacts (clean HTML, Markdown, metadata, asset manifests) and finally export_site({ format: 'folder' }) to produce the chosen output. Monitor progress with get_crawl_status and check results with get_site_tree, get_downloads, and get_extraction_status.
When to Use It
- You want to download a site and export usable artifacts (folder, zip, PDF, etc.) for archival, deployment, or distribution.
- You need a complete crawl-extract-export pipeline rather than a raw crawl.
- You require a WARC or other archival format (warc, pdf) for compliance or offline access.
- You want per-page structured data including clean HTML, Markdown, and metadata (no raw assets).
- You need a production-ready bundle with a crawl-manifest.json for deployment
Quick Start
- Step 1: Configure settings with update_settings({ settings: { maxConcurrent: 4, crawlDelay: 0.5, stripTrackingParams: true }, policy: { scopeMode: "sameDomain", maxDepth: $1 or 3, respectRobotsTxt: true, includeSupportingFiles: true, downloadCrossDomainAssets: true, autoUpgradeHTTP: true } })
- Step 2: Start the crawl with start_crawl({ url: 'your-site-url' })
- Step 3: Run extraction and export: extract_site() and export_site({ format: 'folder' })
Best Practices
- Define maxDepth and maxConcurrent based on the site's size to balance speed and resources.
- Enable respectRobotsTxt and stripTrackingParams to stay compliant and lean.
- Choose the export format to match downstream use (warc for archival, extracted for AI, deploy for deployment).
- Review artifacts with get_site_tree and get_downloads before exporting to catch issues early.
- Save the project after export for future reference using save_project({ name: "your-export" }).
Example Use Cases
- Archive a small corporate site as warc to preserve HTTP headers for compliance.
- Share a blog by exporting to zip with all assets for offline reading.
- Produce extracted data (clean HTML, Markdown, metadata) for AI training and ingestion.
- Create a production-ready deploy bundle including crawl-manifest.json for staging deployment.
- Generate a quick singleHTML export to preview the site in a browser for stakeholders.