Get the FREE Ultimate OpenClaw Setup Guide →

extracting-pdfs

Scanned
npx machina-cli add skill maaarcooo/claude-skills/extracting-pdfs --openclaw
Files (1)
SKILL.md
5.5 KB

PDF Content Extraction Skill

Extract PDF content to clean, organized markdown.

Workflow

  1. Extract — Run script to get raw content + metadata
  2. Analyse — Review for patterns and issues
  3. Clean — Manually rewrite, omitting noise
  4. Organise — Apply formatting principles
  5. Output — Deliver clean markdown

Step 1: Extract

python /mnt/skills/user/extracting-pdfs/scripts/extract_pdf.py \
    /mnt/user-data/uploads/{filename}.pdf \
    /home/claude/extracted/

For scanned PDFs or problematic extractions, use --method pymupdf. For page ranges, use --pages 1-10. To filter small icons, use --min-image-size 100.

Output:

/home/claude/extracted/
├── {filename}.md      # Raw markdown with YAML frontmatter
├── metadata.json      # Structured metadata
└── images/            # Extracted images (if any)

Step 2: Analyse

Read the extracted markdown:

cat /home/claude/extracted/{filename}.md

Check YAML frontmatter for:

  • extraction_method — Which extractor was used
  • total_pages — Document length
  • has_outline — Bookmarks exist (helps with structure)
  • total_images — Number of images

Identify issues requiring cleanup:

  • Repeated footers/headers on every page
  • Watermarks, branding, page numbers
  • Fragmented sentences across line breaks
  • Malformed tables
  • Image markers needing repositioning

Step 3: Clean

Manual cleanup only. Do not write scripts, sed/awk commands, or regex replacements. Read the content and write a clean version directly.

Process:

  1. Read the extracted markdown completely
  2. Identify noise patterns (footers, headers, branding, page numbers)
  3. Write clean output directly, omitting noise as you go

Load references as needed:

Repeated elements & source-specific patterns: See cleanup-patterns.md

  • Use when: footers, headers, SME/PMT branding detected

Text fragmentation: See sentence-reflow.md

  • Use when: sentences split across lines or pages

Table issues: See table-formatting.md

  • Use when: tables have missing delimiters, broken structure

Image handling: See image-handling.md

  • Use when: document contains images to process

Step 4: Organise

While writing the clean output, apply these formatting principles:

Heading Hierarchy

  • Use ###### consistently
  • Don't skip levels
  • Remove redundant numbering if using markdown headers

Paragraph Flow

  • Single blank line between paragraphs
  • Remove orphan lines (single words alone)
  • Merge related short paragraphs

Image Placement

Convert markers to proper markdown:

<!-- Before -->
<!-- IMAGE: images/page003_img001.png (450x280px) -->

<!-- After -->
![Figure 1: Description](./images/page003_img001.png)

View each image with view tool to write accurate alt text.

Step 5: Output

Write Clean File

After reading and mentally processing the extracted content, write the clean markdown directly to a file:

# Write clean content to file (Claude creates this content)
cat > /mnt/user-data/outputs/{filename}_clean.md << 'EOF'
# Document Title

[Clean content goes here - written by Claude, not copied]

EOF

Or use the create_file tool to write the clean content directly.

Copy Images (if applicable)

mkdir -p /mnt/user-data/outputs/images/
cp -r /home/claude/extracted/images/* /mnt/user-data/outputs/images/

Quality Check

Review the clean output against this checklist. If issues found, fix and re-check:

Quality Checklist:
- [ ] No repeated footers/headers
- [ ] No standalone page numbers
- [ ] No watermarks or branding
- [ ] Sentences properly rejoined
- [ ] Tables intact and readable
- [ ] Images converted to markdown syntax
- [ ] Heading hierarchy logical

If any item fails, revise the content and verify again before delivering.

Summary to User

Include:

  • Pages extracted
  • What was cleaned (types of noise removed)
  • Images included (remind about images/ folder requirement)
  • Any limitations noted

Error Handling

ErrorCauseSolution
"File not found"Wrong pathCheck /mnt/user-data/uploads/
"Invalid PDF header"Not a PDFInform user file is invalid
"Extraction failed"Protected/corruptedTry --method pymupdf
Empty outputScanned PDFInform user, suggest OCR

Special Cases

Scanned/Image PDFs

If extraction_method shows pymupdf (fallback) with minimal text:

  • PDF is likely scanned/image-based
  • Inform user OCR tools may be needed

Large Documents (50+ pages)

Consider extracting in ranges:

python extract_pdf.py doc.pdf ./out1/ --pages 1-25
python extract_pdf.py doc.pdf ./out2/ --pages 26-50

Multi-Column Layouts

Verify reading order makes sense. pymupdf4llm handles columns reasonably but may interleave incorrectly.

Output Format

Final markdown structure:

# {Document Title}

## {First Section}

{Clean content...}

## {Second Section}

{Clean content...}

---

*Source: {filename}.pdf | Extracted: {date}*

Source

git clone https://github.com/maaarcooo/claude-skills/blob/main/extracting-pdfs/SKILL.mdView on GitHub

Overview

Converts uploaded PDFs into clean Markdown by extracting text, images, and metadata. It removes noise like footers, watermarks, and page numbers, while reorganizing fragmented content into a coherent structure.

How This Skill Works

Follows a 5-step workflow: extract raw content and metadata, analyze for issues, manually clean noise, organize formatting, and output clean Markdown. It supports scanned PDFs with optional methods (e.g., pymupdf) and preserves images and essential metadata for structure.

When to Use It

  • When a user uploads a PDF and needs a clean, readable Markdown export for notes or knowledge bases.
  • When repeated footers, headers, watermarks, or branding clutter must be removed to improve readability.
  • When the document includes images that should be extracted and placed inline in Markdown.
  • When content is fragmented across line breaks or pages and requires sentence reflow.
  • When you want a structured output with metadata (e.g., total_pages, has_outline, total_images) for documentation workflows.

Quick Start

  1. Step 1: Run extraction to produce raw Markdown and metadata from the uploaded PDF.
  2. Step 2: Review YAML frontmatter and identify noise like footers and fragmented sentences.
  3. Step 3: Manually rewrite to a clean Markdown document and format images inline.

Best Practices

  • Run the extraction to generate raw Markdown plus metadata before cleaning.
  • Check YAML frontmatter fields (extraction_method, total_pages, has_outline, total_images) to guide cleanup.
  • Identify noise patterns (footers, headers, branding, page numbers) and plan their removal manually.
  • Avoid automated regex or scripting during cleanup; rewrite clean content directly.
  • Consult reference patterns (cleanup-patterns.md, sentence-reflow.md, table-formatting.md, image-handling.md) when dealing with specific issues.

Example Use Cases

  • Convert a university research paper into a tidy Markdown document for a course wiki.
  • Process annual reports to remove branding noise and publish in an internal knowledge base.
  • Digitize technical manuals, preserving diagrams as images with accurate alt text.
  • Archive legal PDFs by producing clean Markdown with clear sectioning and bookmarks.
  • Publish e-books as Markdown for a content platform, ensuring cohesive paragraphs and tables.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers