Get the FREE Ultimate OpenClaw Setup Guide →

article-extractor

Scanned
npx machina-cli add skill baslie/claude-best-practices/article-extractor --openclaw
Files (1)
SKILL.md
4.1 KB

Article Extractor (crawl4ai)

Extract main content from web articles using crawl4ai (Playwright-based crawler). Removes navigation, ads, clutter. Saves clean Markdown.

Critical: Windows Encoding

On Windows (Git Bash/MSYS2), Python defaults to cp1251, which crashes on Unicode. ALWAYS prefix Python commands with PYTHONIOENCODING=utf-8:

# CORRECT
PYTHONIOENCODING=utf-8 python -c "..."

# WRONG — will crash on non-ASCII
python -c "..."

Installation

pip install crawl4ai && crawl4ai-setup

Check: python -c "import crawl4ai" 2>/dev/null

Complete Workflow

ARTICLE_URL="https://example.com/article"

# 1. Check if crawl4ai is installed
if ! python -c "import crawl4ai" 2>/dev/null; then
    echo "crawl4ai is not installed. Install with:"
    echo "  pip install crawl4ai && crawl4ai-setup"
    exit 1
fi

# 2. Extract article via crawl4ai Python API
# PYTHONIOENCODING=utf-8 is REQUIRED on Windows to avoid cp1251 crashes
OUTPUT=$(PYTHONIOENCODING=utf-8 python -c "
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig

async def extract():
    browser_config = BrowserConfig(headless=True)
    crawler_config = CrawlerRunConfig(
        word_count_threshold=50,
        excluded_tags=['nav', 'footer', 'header'],
        exclude_external_links=True,
        exclude_social_media_links=True,
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url='$ARTICLE_URL', config=crawler_config)
        if result.success:
            title = result.metadata.get('title', 'Article') if result.metadata else 'Article'
            print(f'TITLE:{title}')
            print('---CONTENT---')
            print(result.markdown)
        else:
            print(f'ERROR:{result.error_message}')

asyncio.run(extract())
")

# 3. Check for errors
if echo "$OUTPUT" | grep -q "^ERROR:"; then
    echo "Extraction failed: $(echo "$OUTPUT" | grep "^ERROR:" | sed 's/^ERROR://')"
    exit 1
fi

# 4. Extract title and content
TITLE=$(echo "$OUTPUT" | grep "^TITLE:" | sed 's/^TITLE://')
CONTENT=$(echo "$OUTPUT" | sed -n '/^---CONTENT---$/,$ p' | tail -n +2)

# 5. Check content is not empty
if [ -z "$CONTENT" ]; then
    echo "Error: Extraction returned empty content. The page may require authentication or use unsupported rendering."
    exit 1
fi

# 6. Clean filename
# Use sed for character replacement (tr with empty replacement causes errors)
# Use cut -c 1-120 — Cyrillic chars are multibyte, shorter cut is too aggressive
FILENAME=$(echo "$TITLE" | sed 's/[\/:<>"|?*]/-/g' | sed 's/--*/-/g' | cut -c 1-120 | sed 's/^ *//;s/ *$//')
FILENAME="${FILENAME}.md"

# 7. Save content to .md file (printf avoids echo issues with -e/-n at start)
printf '%s\n' "$CONTENT" > "$FILENAME"

# 8. Show result
echo "Extracted article: $TITLE"
echo "Saved to: $FILENAME"
echo ""
echo "Preview (first 15 lines):"
head -n 15 "$FILENAME"

Error Handling

ProblemSolution
crawl4ai not installedpip install crawl4ai && crawl4ai-setup
UnicodeEncodeError (cp1251)Prefix with PYTHONIOENCODING=utf-8 — never omit on Windows
result.success == FalseShow result.error_message to user
Empty contentPage may need auth or uses unsupported rendering
Paywall/login requiredcrawl4ai cannot bypass auth walls — inform user
tr error "string2 must be non-empty"Use `sed 's/[/:<>"
Filename too short for CyrillicUse cut -c 1-120 (not 80 — multibyte chars)

After Extraction

Display: article title, filename, preview (first 10-15 lines).

Source

git clone https://github.com/baslie/claude-best-practices/blob/main/skills/article-extractor/SKILL.mdView on GitHub

Overview

Article Extractor uses crawl4ai, a Playwright-based crawler optimized for LLMs, to pull the main content from blog posts, articles, and tutorials. It removes navigation, ads, and clutter, delivering clean Markdown. It also handles JavaScript-rendered pages and SPAs.

How This Skill Works

Given a URL, the tool runs crawl4ai via a Python API to fetch the article content. The crawler is configured to strip nav, header, and footer elements and exclude external links, returning a Markdown block with the title and content. The result is saved as a .md file and the filename and content are surfaced to the user.

When to Use It

  • User provides a URL and wants the article text only
  • User asks to download the article or save it locally
  • User wants to extract content from a URL
  • User wants to save this blog post as Markdown
  • User uses multilingual commands such as сохрани статью, скачай статью, извлеки текст

Quick Start

  1. Step 1: Install crawl4ai and set up the environment
  2. Step 2: Run the Python snippet to extract the article from the URL
  3. Step 3: Save and review the generated Markdown file

Best Practices

  • Validate the URL before processing
  • Prefix Python commands with PYTHONIOENCODING=utf-8 on Windows to avoid encoding errors
  • Configure crawl4ai to remove nav, header, footer and external links
  • Verify the extracted Markdown includes a title and content
  • Sanitize the article title for safe filename and limit length

Example Use Cases

  • Extract and save a tech blog post as Markdown for a knowledge base
  • Download a recipe article and store as clean Markdown with no ads
  • Pull a tutorial page content for offline study
  • Save a news article for archival in Markdown
  • Convert a developer doc page into Markdown for internal docs

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers