pdf-extract
npx machina-cli add skill maaarcooo/claude-skills/pdf-extract --openclawPDF Content Extraction Skill
Extract PDF content to clean, organized markdown.
Workflow
- Extract — Run script to get raw content + metadata
- Analyse — Review for patterns and issues
- Clean — Manually remove noise (footers, watermarks, branding)
- Organise — Restructure fragmented content
- Output — Deliver clean markdown
Note: Only Step 1 uses a script. Steps 2–5 are performed manually by Claude reading and rewriting content. Do not write cleanup scripts.
Step 1: Extract
python /mnt/skills/user/pdf-extract/scripts/extract_pdf.py \
/mnt/user-data/uploads/{filename}.pdf \
/home/claude/extracted/
Options:
| Option | Description |
|---|---|
--pages 1-10 | Extract specific page range |
--method pymupdf4llm | Force primary extractor (better formatting) |
--method pymupdf | Force fallback (more reliable for scanned PDFs) |
--min-image-size 100 | Skip images smaller than 100px (filters icons) |
Output:
/home/claude/extracted/
├── {filename}.md # Raw markdown with YAML frontmatter
├── metadata.json # Structured metadata
└── images/ # Extracted images (if any)
Step 2: Analyse
Read the extracted markdown:
cat /home/claude/extracted/{filename}.md
Check YAML frontmatter for:
extraction_method— Which extractor was usedtotal_pages— Document lengthhas_outline— Bookmarks exist (helps with structure)total_images— Number of images
Identify issues requiring cleanup:
- Repeated footers/headers on every page
- Watermarks, branding, page numbers
- Fragmented sentences across line breaks
- Malformed tables
- Image markers needing repositioning
Step 3: Clean
IMPORTANT: Manual cleanup only.
- Do NOT write Python scripts to clean the content
- Do NOT use sed, awk, or regex replacement commands
- Do NOT copy-paste the raw content and run substitutions
Instead: Read the extracted content, understand it, then write a clean version from scratch, omitting the noise as you write.
Why manual? Each PDF has unique patterns. Claude makes better contextual decisions than automated rules — knowing what's noise vs. legitimate content, handling edge cases, and preserving meaning.
Process:
- Read through the extracted markdown completely
- Identify repeated noise (footers, headers, branding, page numbers)
- Note the actual content structure (sections, flow, key information)
- Write the clean output directly, skipping noise as you go
Load references as needed for pattern recognition:
Repeated elements & source-specific patterns: See cleanup-patterns.md
- Use when: footers, headers, SME/PMT branding detected
Text fragmentation: See sentence-reflow.md
- Use when: sentences split across lines or pages
Table issues: See table-formatting.md
- Use when: tables have missing delimiters, broken structure
Image handling: See image-handling.md
- Use when: document contains images to process
Step 4: Organise
While writing the clean output, apply these formatting principles:
Heading Hierarchy
- Use
#→##→###consistently - Don't skip levels
- Remove redundant numbering if using markdown headers
Paragraph Flow
- Single blank line between paragraphs
- Remove orphan lines (single words alone)
- Merge related short paragraphs
Image Placement
Convert markers to proper markdown:
<!-- Before -->
<!-- IMAGE: images/page003_img001.png (450x280px) -->
<!-- After -->

View each image with view tool to write accurate alt text.
Step 5: Output
Write Clean File
After reading and mentally processing the extracted content, write the clean markdown directly to a file:
# Write clean content to file (Claude creates this content)
cat > /mnt/user-data/outputs/{filename}_clean.md << 'EOF'
# Document Title
[Clean content goes here - written by Claude, not copied]
EOF
Or use the create_file tool to write the clean content directly.
Copy Images (if applicable)
mkdir -p /mnt/user-data/outputs/images/
cp -r /home/claude/extracted/images/* /mnt/user-data/outputs/images/
Quality Check
- No repeated footers/headers
- No standalone page numbers
- No watermarks or branding
- Sentences properly rejoined
- Tables intact and readable
- Images converted to markdown syntax
- Heading hierarchy logical
Summary to User
Include:
- Pages extracted
- What was cleaned (types of noise removed)
- Images included (remind about
images/folder requirement) - Any limitations noted
Error Handling
| Error | Cause | Solution |
|---|---|---|
| "File not found" | Wrong path | Check /mnt/user-data/uploads/ |
| "Invalid PDF header" | Not a PDF | Inform user file is invalid |
| "Extraction failed" | Protected/corrupted | Try --method pymupdf |
| Empty output | Scanned PDF | Inform user, suggest OCR |
Special Cases
Scanned/Image PDFs
If extraction_method shows pymupdf (fallback) with minimal text:
- PDF is likely scanned/image-based
- Inform user OCR tools may be needed
Large Documents (50+ pages)
Consider extracting in ranges:
python extract_pdf.py doc.pdf ./out1/ --pages 1-25
python extract_pdf.py doc.pdf ./out2/ --pages 26-50
Multi-Column Layouts
Verify reading order makes sense. pymupdf4llm handles columns reasonably but may interleave incorrectly.
Output Format
Final markdown structure:
# {Document Title}
## {First Section}
{Clean content...}
## {Second Section}
{Clean content...}
---
*Source: {filename}.pdf | Extracted: {date}*
Source
git clone https://github.com/maaarcooo/claude-skills/blob/main/archive/pdf-extract/SKILL.mdView on GitHub Overview
pdf-extract converts PDFs into clean, readable Markdown by extracting text, images, and metadata while removing noise such as footers, watermarks, and page numbers. It guides a manual cleanup process to restructure fragmented content into a coherent document. The workflow covers extraction, analysis, cleaning, organization, and delivering final Markdown.
How This Skill Works
A script extracts raw content and metadata from the PDF. An analyst then reviews frontmatter, identifies noise patterns, and manually rewrites a clean Markdown from scratch, reflowing text and reorganizing sections. Cleanup is strictly manual; no cleanup scripts are written or executed.
When to Use It
- User uploads a PDF and wants a clean, readable Markdown version.
- Need to remove repeated footers, watermarks, branding, or page numbers.
- Working with scanned PDFs to ensure reliable text extraction and image capture.
- Content is fragmented across pages and needs restructuring into a coherent flow.
- Preparing content for publishing or inclusion in a knowledge base as Markdown.
Quick Start
- Step 1: Run the extraction script to generate raw Markdown, metadata.json, and images.
- Step 2: Review the extracted YAML frontmatter and identify noise patterns.
- Step 3: Manually craft the clean Markdown from scratch, preserving structure.
Best Practices
- Run Step 1 to extract raw content and metadata, then review YAML frontmatter for extraction_method and total_pages.
- Identify noise patterns like footers, headers, watermarks, and page numbers before rewriting.
- Maintain a consistent heading hierarchy and logical paragraph flow in the final Markdown.
- Preserve images with proper placement and captions, converting markers to clean Markdown references.
- Rely on the documented manual cleanup approach and avoid automated scripting for cleanup.
Example Use Cases
- Convert a product manual into Markdown for a docs site, stripping branding noise.
- Extract a research paper and publish a clean readable summary with figures.
- Prepare a white paper for a knowledge base by removing watermarks and page numbers.
- Turn a scanned brochure into structured Markdown with images inline.
- Clean up an academic PDF and restructure sections for easy navigation.