Get the FREE Ultimate OpenClaw Setup Guide →

extract-and-export

npx machina-cli add skill Crawlio-app/crawlio-plugin/extract-and-export --openclaw
Files (1)
SKILL.md
3.8 KB

extract-and-export

Complete crawl-extract-export pipeline. Crawls a site, extracts structured content (clean HTML, markdown, metadata, asset manifests), and exports in any of 7 formats.

When to Use

Use this skill when the user wants to download a site AND get usable output — not just a raw crawl, but extracted content ready for consumption, archival, or deployment.

For crawl-only workflows (no extraction or export), use crawl-site instead.

Arguments

  • $0 (required): The URL to crawl
  • $1 (optional): Maximum crawl depth (default: 3)
  • $2 (optional): Export format (default: folder)

Export Formats

FormatDescription
folderMirror on disk with original directory structure
zipCompressed archive, ready to share
singleHTMLAll assets inlined into a single HTML file
warcISO 28500 web archive standard
pdfRendered pages as portable document
extractedStructured data only — clean HTML, markdown, metadata, no raw assets
deployProduction-ready bundle with crawl-manifest.json

Workflow

1. Configure Settings

update_settings({
  settings: {
    maxConcurrent: 4,
    crawlDelay: 0.5,
    stripTrackingParams: true
  },
  policy: {
    scopeMode: "sameDomain",
    maxDepth: $1 or 3,
    respectRobotsTxt: true,
    includeSupportingFiles: true,
    downloadCrossDomainAssets: true,
    autoUpgradeHTTP: true
  }
})

Adjust based on site size:

  • Small site (<100 pages): maxDepth: 10, maxConcurrent: 8
  • Medium site (100-1000): maxDepth: 5, maxConcurrent: 4
  • Large site (1000+): maxDepth: 3, maxPagesPerCrawl: 500

2. Start the Crawl

start_crawl({ url: "$0" })

3. Monitor Progress

Poll get_crawl_status with since parameter for efficient change detection:

get_crawl_status()
// Returns: seq: 42, downloaded: 85/150

get_crawl_status({ since: 42 })
// Returns: "No changes" or updated status

4. Check for Issues

After crawl completes:

get_failed_urls()     // Any failures to retry?
get_errors()          // Any engine errors?

Retry transient failures:

recrawl_urls({ urls: ["https://example.com/failed-page"] })

5. Review What Was Downloaded

get_site_tree()       // File structure overview
get_downloads()       // Detailed download info with content types

6. Extract Content

extract_site()

This runs the extraction pipeline and produces per-page artifacts:

  • Clean HTML (tracking scripts removed)
  • Markdown conversion
  • Metadata (title, description, headings, links)
  • Asset manifests

Poll get_extraction_status if the extraction takes time.

7. Export

export_site({ format: "$2" or "folder" })

Poll get_export_status for large exports.

8. Report Results

Summarize:

  • Crawl: Total pages discovered, downloaded, failed
  • Extraction: Pages processed, artifacts created
  • Export: Format, location, file size
  • Issues: Any errors or notable findings

Tips

  • For archival workflows, use warc — it's the ISO standard and preserves full HTTP headers
  • For AI consumption, use extracted — just the structured data, no raw assets
  • For sharing, use zip — compressed and portable
  • For deployment, use deploy — includes crawl-manifest.json with full metadata
  • For large sites, set maxPagesPerCrawl to avoid runaway crawls
  • Save the project after export for future reference: save_project({ name: "example.com export" })

Source

git clone https://github.com/Crawlio-app/crawlio-plugin/blob/main/skills/extract-and-export/SKILL.mdView on GitHub

Overview

extract-and-export crawls a site, extracts structured content (clean HTML, markdown, metadata, asset manifests), and exports in seven formats. It yields artifacts ready for archival, deployment, or AI consumption.

How This Skill Works

Configure the crawl and export settings with update_settings, then start the crawl using start_crawl({ url: 'your-site-url' }). After crawling, run extract_site() to generate per-page artifacts (clean HTML, Markdown, metadata, asset manifests) and finally export_site({ format: 'folder' }) to produce the chosen output. Monitor progress with get_crawl_status and check results with get_site_tree, get_downloads, and get_extraction_status.

When to Use It

  • You want to download a site and export usable artifacts (folder, zip, PDF, etc.) for archival, deployment, or distribution.
  • You need a complete crawl-extract-export pipeline rather than a raw crawl.
  • You require a WARC or other archival format (warc, pdf) for compliance or offline access.
  • You want per-page structured data including clean HTML, Markdown, and metadata (no raw assets).
  • You need a production-ready bundle with a crawl-manifest.json for deployment

Quick Start

  1. Step 1: Configure settings with update_settings({ settings: { maxConcurrent: 4, crawlDelay: 0.5, stripTrackingParams: true }, policy: { scopeMode: "sameDomain", maxDepth: $1 or 3, respectRobotsTxt: true, includeSupportingFiles: true, downloadCrossDomainAssets: true, autoUpgradeHTTP: true } })
  2. Step 2: Start the crawl with start_crawl({ url: 'your-site-url' })
  3. Step 3: Run extraction and export: extract_site() and export_site({ format: 'folder' })

Best Practices

  • Define maxDepth and maxConcurrent based on the site's size to balance speed and resources.
  • Enable respectRobotsTxt and stripTrackingParams to stay compliant and lean.
  • Choose the export format to match downstream use (warc for archival, extracted for AI, deploy for deployment).
  • Review artifacts with get_site_tree and get_downloads before exporting to catch issues early.
  • Save the project after export for future reference using save_project({ name: "your-export" }).

Example Use Cases

  • Archive a small corporate site as warc to preserve HTTP headers for compliance.
  • Share a blog by exporting to zip with all assets for offline reading.
  • Produce extracted data (clean HTML, Markdown, metadata) for AI training and ingestion.
  • Create a production-ready deploy bundle including crawl-manifest.json for staging deployment.
  • Generate a quick singleHTML export to preview the site in a browser for stakeholders.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers