Get the FREE Ultimate OpenClaw Setup Guide →

pdf-extract

npx machina-cli add skill maaarcooo/claude-skills/pdf-extract --openclaw
Files (1)
SKILL.md
6.2 KB

PDF Content Extraction Skill

Extract PDF content to clean, organized markdown.

Workflow

  1. Extract — Run script to get raw content + metadata
  2. Analyse — Review for patterns and issues
  3. CleanManually remove noise (footers, watermarks, branding)
  4. Organise — Restructure fragmented content
  5. Output — Deliver clean markdown

Note: Only Step 1 uses a script. Steps 2–5 are performed manually by Claude reading and rewriting content. Do not write cleanup scripts.

Step 1: Extract

python /mnt/skills/user/pdf-extract/scripts/extract_pdf.py \
    /mnt/user-data/uploads/{filename}.pdf \
    /home/claude/extracted/

Options:

OptionDescription
--pages 1-10Extract specific page range
--method pymupdf4llmForce primary extractor (better formatting)
--method pymupdfForce fallback (more reliable for scanned PDFs)
--min-image-size 100Skip images smaller than 100px (filters icons)

Output:

/home/claude/extracted/
├── {filename}.md      # Raw markdown with YAML frontmatter
├── metadata.json      # Structured metadata
└── images/            # Extracted images (if any)

Step 2: Analyse

Read the extracted markdown:

cat /home/claude/extracted/{filename}.md

Check YAML frontmatter for:

  • extraction_method — Which extractor was used
  • total_pages — Document length
  • has_outline — Bookmarks exist (helps with structure)
  • total_images — Number of images

Identify issues requiring cleanup:

  • Repeated footers/headers on every page
  • Watermarks, branding, page numbers
  • Fragmented sentences across line breaks
  • Malformed tables
  • Image markers needing repositioning

Step 3: Clean

IMPORTANT: Manual cleanup only.

  • Do NOT write Python scripts to clean the content
  • Do NOT use sed, awk, or regex replacement commands
  • Do NOT copy-paste the raw content and run substitutions

Instead: Read the extracted content, understand it, then write a clean version from scratch, omitting the noise as you write.

Why manual? Each PDF has unique patterns. Claude makes better contextual decisions than automated rules — knowing what's noise vs. legitimate content, handling edge cases, and preserving meaning.

Process:

  1. Read through the extracted markdown completely
  2. Identify repeated noise (footers, headers, branding, page numbers)
  3. Note the actual content structure (sections, flow, key information)
  4. Write the clean output directly, skipping noise as you go

Load references as needed for pattern recognition:

Repeated elements & source-specific patterns: See cleanup-patterns.md

  • Use when: footers, headers, SME/PMT branding detected

Text fragmentation: See sentence-reflow.md

  • Use when: sentences split across lines or pages

Table issues: See table-formatting.md

  • Use when: tables have missing delimiters, broken structure

Image handling: See image-handling.md

  • Use when: document contains images to process

Step 4: Organise

While writing the clean output, apply these formatting principles:

Heading Hierarchy

  • Use ###### consistently
  • Don't skip levels
  • Remove redundant numbering if using markdown headers

Paragraph Flow

  • Single blank line between paragraphs
  • Remove orphan lines (single words alone)
  • Merge related short paragraphs

Image Placement

Convert markers to proper markdown:

<!-- Before -->
<!-- IMAGE: images/page003_img001.png (450x280px) -->

<!-- After -->
![Figure 1: Description](./images/page003_img001.png)

View each image with view tool to write accurate alt text.

Step 5: Output

Write Clean File

After reading and mentally processing the extracted content, write the clean markdown directly to a file:

# Write clean content to file (Claude creates this content)
cat > /mnt/user-data/outputs/{filename}_clean.md << 'EOF'
# Document Title

[Clean content goes here - written by Claude, not copied]

EOF

Or use the create_file tool to write the clean content directly.

Copy Images (if applicable)

mkdir -p /mnt/user-data/outputs/images/
cp -r /home/claude/extracted/images/* /mnt/user-data/outputs/images/

Quality Check

  • No repeated footers/headers
  • No standalone page numbers
  • No watermarks or branding
  • Sentences properly rejoined
  • Tables intact and readable
  • Images converted to markdown syntax
  • Heading hierarchy logical

Summary to User

Include:

  • Pages extracted
  • What was cleaned (types of noise removed)
  • Images included (remind about images/ folder requirement)
  • Any limitations noted

Error Handling

ErrorCauseSolution
"File not found"Wrong pathCheck /mnt/user-data/uploads/
"Invalid PDF header"Not a PDFInform user file is invalid
"Extraction failed"Protected/corruptedTry --method pymupdf
Empty outputScanned PDFInform user, suggest OCR

Special Cases

Scanned/Image PDFs

If extraction_method shows pymupdf (fallback) with minimal text:

  • PDF is likely scanned/image-based
  • Inform user OCR tools may be needed

Large Documents (50+ pages)

Consider extracting in ranges:

python extract_pdf.py doc.pdf ./out1/ --pages 1-25
python extract_pdf.py doc.pdf ./out2/ --pages 26-50

Multi-Column Layouts

Verify reading order makes sense. pymupdf4llm handles columns reasonably but may interleave incorrectly.

Output Format

Final markdown structure:

# {Document Title}

## {First Section}

{Clean content...}

## {Second Section}

{Clean content...}

---

*Source: {filename}.pdf | Extracted: {date}*

Source

git clone https://github.com/maaarcooo/claude-skills/blob/main/archive/pdf-extract/SKILL.mdView on GitHub

Overview

pdf-extract converts PDFs into clean, readable Markdown by extracting text, images, and metadata while removing noise such as footers, watermarks, and page numbers. It guides a manual cleanup process to restructure fragmented content into a coherent document. The workflow covers extraction, analysis, cleaning, organization, and delivering final Markdown.

How This Skill Works

A script extracts raw content and metadata from the PDF. An analyst then reviews frontmatter, identifies noise patterns, and manually rewrites a clean Markdown from scratch, reflowing text and reorganizing sections. Cleanup is strictly manual; no cleanup scripts are written or executed.

When to Use It

  • User uploads a PDF and wants a clean, readable Markdown version.
  • Need to remove repeated footers, watermarks, branding, or page numbers.
  • Working with scanned PDFs to ensure reliable text extraction and image capture.
  • Content is fragmented across pages and needs restructuring into a coherent flow.
  • Preparing content for publishing or inclusion in a knowledge base as Markdown.

Quick Start

  1. Step 1: Run the extraction script to generate raw Markdown, metadata.json, and images.
  2. Step 2: Review the extracted YAML frontmatter and identify noise patterns.
  3. Step 3: Manually craft the clean Markdown from scratch, preserving structure.

Best Practices

  • Run Step 1 to extract raw content and metadata, then review YAML frontmatter for extraction_method and total_pages.
  • Identify noise patterns like footers, headers, watermarks, and page numbers before rewriting.
  • Maintain a consistent heading hierarchy and logical paragraph flow in the final Markdown.
  • Preserve images with proper placement and captions, converting markers to clean Markdown references.
  • Rely on the documented manual cleanup approach and avoid automated scripting for cleanup.

Example Use Cases

  • Convert a product manual into Markdown for a docs site, stripping branding noise.
  • Extract a research paper and publish a clean readable summary with figures.
  • Prepare a white paper for a knowledge base by removing watermarks and page numbers.
  • Turn a scanned brochure into structured Markdown with images inline.
  • Clean up an academic PDF and restructure sections for easy navigation.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers