article-extractor
Scannednpx machina-cli add skill baslie/claude-best-practices/article-extractor --openclawArticle Extractor (crawl4ai)
Extract main content from web articles using crawl4ai (Playwright-based crawler). Removes navigation, ads, clutter. Saves clean Markdown.
Critical: Windows Encoding
On Windows (Git Bash/MSYS2), Python defaults to cp1251, which crashes on Unicode. ALWAYS prefix Python commands with PYTHONIOENCODING=utf-8:
# CORRECT
PYTHONIOENCODING=utf-8 python -c "..."
# WRONG — will crash on non-ASCII
python -c "..."
Installation
pip install crawl4ai && crawl4ai-setup
Check: python -c "import crawl4ai" 2>/dev/null
Complete Workflow
ARTICLE_URL="https://example.com/article"
# 1. Check if crawl4ai is installed
if ! python -c "import crawl4ai" 2>/dev/null; then
echo "crawl4ai is not installed. Install with:"
echo " pip install crawl4ai && crawl4ai-setup"
exit 1
fi
# 2. Extract article via crawl4ai Python API
# PYTHONIOENCODING=utf-8 is REQUIRED on Windows to avoid cp1251 crashes
OUTPUT=$(PYTHONIOENCODING=utf-8 python -c "
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
async def extract():
browser_config = BrowserConfig(headless=True)
crawler_config = CrawlerRunConfig(
word_count_threshold=50,
excluded_tags=['nav', 'footer', 'header'],
exclude_external_links=True,
exclude_social_media_links=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url='$ARTICLE_URL', config=crawler_config)
if result.success:
title = result.metadata.get('title', 'Article') if result.metadata else 'Article'
print(f'TITLE:{title}')
print('---CONTENT---')
print(result.markdown)
else:
print(f'ERROR:{result.error_message}')
asyncio.run(extract())
")
# 3. Check for errors
if echo "$OUTPUT" | grep -q "^ERROR:"; then
echo "Extraction failed: $(echo "$OUTPUT" | grep "^ERROR:" | sed 's/^ERROR://')"
exit 1
fi
# 4. Extract title and content
TITLE=$(echo "$OUTPUT" | grep "^TITLE:" | sed 's/^TITLE://')
CONTENT=$(echo "$OUTPUT" | sed -n '/^---CONTENT---$/,$ p' | tail -n +2)
# 5. Check content is not empty
if [ -z "$CONTENT" ]; then
echo "Error: Extraction returned empty content. The page may require authentication or use unsupported rendering."
exit 1
fi
# 6. Clean filename
# Use sed for character replacement (tr with empty replacement causes errors)
# Use cut -c 1-120 — Cyrillic chars are multibyte, shorter cut is too aggressive
FILENAME=$(echo "$TITLE" | sed 's/[\/:<>"|?*]/-/g' | sed 's/--*/-/g' | cut -c 1-120 | sed 's/^ *//;s/ *$//')
FILENAME="${FILENAME}.md"
# 7. Save content to .md file (printf avoids echo issues with -e/-n at start)
printf '%s\n' "$CONTENT" > "$FILENAME"
# 8. Show result
echo "Extracted article: $TITLE"
echo "Saved to: $FILENAME"
echo ""
echo "Preview (first 15 lines):"
head -n 15 "$FILENAME"
Error Handling
| Problem | Solution |
|---|---|
| crawl4ai not installed | pip install crawl4ai && crawl4ai-setup |
| UnicodeEncodeError (cp1251) | Prefix with PYTHONIOENCODING=utf-8 — never omit on Windows |
result.success == False | Show result.error_message to user |
| Empty content | Page may need auth or uses unsupported rendering |
| Paywall/login required | crawl4ai cannot bypass auth walls — inform user |
tr error "string2 must be non-empty" | Use `sed 's/[/:<>" |
| Filename too short for Cyrillic | Use cut -c 1-120 (not 80 — multibyte chars) |
After Extraction
Display: article title, filename, preview (first 10-15 lines).
Source
git clone https://github.com/baslie/claude-best-practices/blob/main/skills/article-extractor/SKILL.mdView on GitHub Overview
Article Extractor uses crawl4ai, a Playwright-based crawler optimized for LLMs, to pull the main content from blog posts, articles, and tutorials. It removes navigation, ads, and clutter, delivering clean Markdown. It also handles JavaScript-rendered pages and SPAs.
How This Skill Works
Given a URL, the tool runs crawl4ai via a Python API to fetch the article content. The crawler is configured to strip nav, header, and footer elements and exclude external links, returning a Markdown block with the title and content. The result is saved as a .md file and the filename and content are surfaced to the user.
When to Use It
- User provides a URL and wants the article text only
- User asks to download the article or save it locally
- User wants to extract content from a URL
- User wants to save this blog post as Markdown
- User uses multilingual commands such as сохрани статью, скачай статью, извлеки текст
Quick Start
- Step 1: Install crawl4ai and set up the environment
- Step 2: Run the Python snippet to extract the article from the URL
- Step 3: Save and review the generated Markdown file
Best Practices
- Validate the URL before processing
- Prefix Python commands with PYTHONIOENCODING=utf-8 on Windows to avoid encoding errors
- Configure crawl4ai to remove nav, header, footer and external links
- Verify the extracted Markdown includes a title and content
- Sanitize the article title for safe filename and limit length
Example Use Cases
- Extract and save a tech blog post as Markdown for a knowledge base
- Download a recipe article and store as clean Markdown with no ads
- Pull a tutorial page content for offline study
- Save a news article for archival in Markdown
- Convert a developer doc page into Markdown for internal docs