What export formats are supported?

folder, zip, singleHTML, warc, pdf, extracted, deploy.

How should I tune settings for large sites?

Adjust maxDepth, maxPagesPerCrawl, and maxConcurrent according to site size; refer to the size guidelines in the Skill. Use higher depth and concurrency for small sites and tighten for large ones.

extract-and-export

npx machina-cli add skill Crawlio-app/crawlio-plugin/extract-and-export --openclaw

Files (1)

SKILL.md

3.8 KB

extract-and-export

Complete crawl-extract-export pipeline. Crawls a site, extracts structured content (clean HTML, markdown, metadata, asset manifests), and exports in any of 7 formats.

When to Use

Use this skill when the user wants to download a site AND get usable output — not just a raw crawl, but extracted content ready for consumption, archival, or deployment.

For crawl-only workflows (no extraction or export), use crawl-site instead.

Arguments

$0 (required): The URL to crawl
$1 (optional): Maximum crawl depth (default: 3)
$2 (optional): Export format (default: folder)

Export Formats

Format	Description
`folder`	Mirror on disk with original directory structure
`zip`	Compressed archive, ready to share
`singleHTML`	All assets inlined into a single HTML file
`warc`	ISO 28500 web archive standard
`pdf`	Rendered pages as portable document
`extracted`	Structured data only — clean HTML, markdown, metadata, no raw assets
`deploy`	Production-ready bundle with crawl-manifest.json

Workflow

1. Configure Settings

update_settings({
  settings: {
    maxConcurrent: 4,
    crawlDelay: 0.5,
    stripTrackingParams: true
  },
  policy: {
    scopeMode: "sameDomain",
    maxDepth: $1 or 3,
    respectRobotsTxt: true,
    includeSupportingFiles: true,
    downloadCrossDomainAssets: true,
    autoUpgradeHTTP: true
  }
})

Adjust based on site size:

Small site (<100 pages): maxDepth: 10, maxConcurrent: 8
Medium site (100-1000): maxDepth: 5, maxConcurrent: 4
Large site (1000+): maxDepth: 3, maxPagesPerCrawl: 500

2. Start the Crawl

start_crawl({ url: "$0" })

3. Monitor Progress

Poll get_crawl_status with since parameter for efficient change detection:

get_crawl_status()
// Returns: seq: 42, downloaded: 85/150

get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status

4. Check for Issues

After crawl completes:

get_failed_urls()     // Any failures to retry?
get_errors()          // Any engine errors?

Retry transient failures:

recrawl_urls({ urls: ["https://example.com/failed-page"] })

5. Review What Was Downloaded

get_site_tree()       // File structure overview
get_downloads()       // Detailed download info with content types

6. Extract Content

extract_site()

This runs the extraction pipeline and produces per-page artifacts:

Clean HTML (tracking scripts removed)
Markdown conversion
Metadata (title, description, headings, links)
Asset manifests

Poll get_extraction_status if the extraction takes time.

7. Export

export_site({ format: "$2" or "folder" })

Poll get_export_status for large exports.

8. Report Results

Summarize:

Crawl: Total pages discovered, downloaded, failed
Extraction: Pages processed, artifacts created
Export: Format, location, file size
Issues: Any errors or notable findings

Tips

For archival workflows, use warc — it's the ISO standard and preserves full HTTP headers
For AI consumption, use extracted — just the structured data, no raw assets
For sharing, use zip — compressed and portable
For deployment, use deploy — includes crawl-manifest.json with full metadata
For large sites, set maxPagesPerCrawl to avoid runaway crawls
Save the project after export for future reference: save_project({ name: "example.com export" })

Source

git clone https://github.com/Crawlio-app/crawlio-plugin/blob/main/skills/extract-and-export/SKILL.mdView on GitHub

Overview

extract-and-export crawls a site, extracts structured content (clean HTML, markdown, metadata, asset manifests), and exports in seven formats. It yields artifacts ready for archival, deployment, or AI consumption.

How This Skill Works

Configure the crawl and export settings with update_settings, then start the crawl using start_crawl({ url: 'your-site-url' }). After crawling, run extract_site() to generate per-page artifacts (clean HTML, Markdown, metadata, asset manifests) and finally export_site({ format: 'folder' }) to produce the chosen output. Monitor progress with get_crawl_status and check results with get_site_tree, get_downloads, and get_extraction_status.

When to Use It

You want to download a site and export usable artifacts (folder, zip, PDF, etc.) for archival, deployment, or distribution.
You need a complete crawl-extract-export pipeline rather than a raw crawl.
You require a WARC or other archival format (warc, pdf) for compliance or offline access.
You want per-page structured data including clean HTML, Markdown, and metadata (no raw assets).
You need a production-ready bundle with a crawl-manifest.json for deployment

Quick Start

Step 1: Configure settings with update_settings({ settings: { maxConcurrent: 4, crawlDelay: 0.5, stripTrackingParams: true }, policy: { scopeMode: "sameDomain", maxDepth: $1 or 3, respectRobotsTxt: true, includeSupportingFiles: true, downloadCrossDomainAssets: true, autoUpgradeHTTP: true } })
Step 2: Start the crawl with start_crawl({ url: 'your-site-url' })
Step 3: Run extraction and export: extract_site() and export_site({ format: 'folder' })

Best Practices

Define maxDepth and maxConcurrent based on the site's size to balance speed and resources.
Enable respectRobotsTxt and stripTrackingParams to stay compliant and lean.
Choose the export format to match downstream use (warc for archival, extracted for AI, deploy for deployment).
Review artifacts with get_site_tree and get_downloads before exporting to catch issues early.
Save the project after export for future reference using save_project({ name: "your-export" }).

Example Use Cases

Archive a small corporate site as warc to preserve HTTP headers for compliance.
Share a blog by exporting to zip with all assets for offline reading.
Produce extracted data (clean HTML, Markdown, metadata) for AI training and ingestion.
Create a production-ready deploy bundle including crawl-manifest.json for staging deployment.
Generate a quick singleHTML export to preview the site in a browser for stakeholders.

Frequently Asked Questions

Add this skill to your agents

extract-and-export

extract-and-export

When to Use

Arguments

Export Formats

Workflow

1. Configure Settings

2. Start the Crawl

3. Monitor Progress

4. Check for Issues

5. Review What Was Downloaded

6. Extract Content

7. Export

8. Report Results

Tips

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What export formats are supported?

How do I run a crawl-only workflow?

How should I tune settings for large sites?