What is pdf-extractor used for?

To extract text and structured data from PDFs using multiple backends with automatic fallback, including OCR for scanned documents.

Which backends are supported?

CPU backends include markitdown, pdfplumber, pdfminer, and pypdf2; GPU backends include docling and marker, with additional options described in the backend guide.

How do I customize backends or view options?

Use --backends to set the extraction order and --list-backends to see available options and GPU status.

pdf-extractor

Scanned

npx machina-cli add skill ahundt/autorun/pdf-extractor --openclaw

Files (1)

SKILL.md

11.8 KB

PDF Data Extraction

Extract text and structured data from PDF documents using a multi-backend approach with automatic fallback.

Overview

This skill provides PDF text extraction with 9 different backends, automatic GPU detection, and intelligent backend selection. The extraction system tries backends in order until one succeeds, producing markdown output optimized for further processing.

Quick Start Workflow

To extract text from PDFs:

Single file extraction (installed CLI - recommended):
```
extract-pdfs /path/to/document.pdf
```
Output: Creates document.md in the same directory.
Batch extraction (directory):
```
extract-pdfs /path/to/pdfs/ /path/to/output/
```
Output: Creates .md files for all PDFs in output directory.
Custom output file:
```
extract-pdfs document.pdf output.md
```

Specific backends:

extract-pdfs document.pdf --backends markitdown pdfplumber

List available backends:
```
extract-pdfs --list-backends
```
Output: Shows available backends and GPU status.

Alternative Execution Methods

If the extract-pdfs CLI isn't installed, install it first (recommended):

# Install as global UV tool (from repo root):
cd "${CLAUDE_PLUGIN_ROOT}/../.." && uv tool install --force --editable plugins/pdf-extractor
extract-pdfs --list-backends  # verify

Or use these fallback methods without installing:

# uv run (recommended fallback — no install required):
uv run --project "${CLAUDE_PLUGIN_ROOT}" python -m pdf_extraction document.pdf

# Standalone script execution
python "${CLAUDE_PLUGIN_ROOT}/src/pdf_extraction/cli.py" document.pdf

Backend Selection Guide

Custom Backend Ordering

Specify backends in any order with --backends. The system tries each in order, stopping on first success:

# Tables first, then general extraction
extract-pdfs document.pdf --backends pdfplumber markitdown pdfminer

# Scanned documents: vision-based first
extract-pdfs scanned.pdf --backends marker docling markitdown

# Most permissive fallback order (handles problematic PDFs)
extract-pdfs document.pdf --backends pdfminer pypdf2 markitdown

# Single backend only (no fallback)
extract-pdfs document.pdf --backends markitdown

CPU-Only Systems (Default)

For systems without GPU, the recommended backend order:

markitdown - Microsoft's lightweight converter (MIT, fast, no models)
pdfplumber - Excellent for tables (MIT)
pdfminer - Pure Python, reliable (MIT)
pypdf2 - Basic extraction, always available (BSD-3)

GPU Systems

For systems with CUDA-enabled GPU:

docling - IBM layout analysis (MIT, ~500MB models)
marker - Vision-based, best for scanned docs (GPL-3.0, ~1GB models)
Plus all CPU backends as fallback

Backend Comparison

Backend	License	Models	Best For	Speed
markitdown	MIT	None	General text, forms	Fast
pdfplumber	MIT	None	Tables, structured data	Fast
pdfminer	MIT	None	Simple text documents	Fast
pypdf2	BSD-3	None	Basic extraction	Fast
docling	MIT	~500MB	Layout analysis	Medium
marker	GPL-3.0	~1GB	Scanned documents	Slow
pymupdf4llm	AGPL-3.0	None	LLM-optimized output	Fast
pdfbox	Apache-2.0	None	Tables (Java-based)	Medium
pdftotext	System	None	Simple text (CLI)	Fast

Backend Decision Matrix

Document Type	Recommended Backend(s)	Why
Digital text PDF (default)	markitdown, pdfplumber	Fast, accurate
PDF with tables/invoices	pdfplumber, pdfbox	Best table structure
Complex layouts/columns	docling (GPU)	Layout analysis
Scanned documents/images	marker, docling (GPU)	OCR/vision required
Insurance policies/forms	markitdown, pdfplumber	Handles form fields
Academic papers	docling	Equations, figures
Maximum compatibility	pdfminer, pypdf2	Fewest dependencies
Commercial use required	markitdown, pdfplumber	MIT license

Programmatic Usage

To use the extraction library directly in Python code:

from pdf_extraction import extract_single_pdf, pdf_to_txt, detect_gpu_availability

# Check available backends
gpu_info = detect_gpu_availability()
print(f"Recommended backends: {gpu_info['recommended_backends']}")

# Extract single file
result = extract_single_pdf(
    input_file='/path/to/document.pdf',
    output_file='/path/to/output.md',
    backends=['markitdown', 'pdfplumber']
)

if result['success']:
    print(f"Extracted with {result['backend_used']}")
    print(f"Quality metrics: {result['quality_metrics']}")

# Batch extract directory
output_files, metadata = pdf_to_txt(
    input_dir='/path/to/pdfs/',
    output_dir='/path/to/output/',
    resume=True,  # Skip already-extracted files
    return_metadata=True
)

Extraction Metadata

Every extraction returns metadata for quality assessment:

{
    'success': True,
    'backend_used': 'markitdown',
    'extraction_time_seconds': 2.5,
    'output_size_bytes': 15234,
    'quality_metrics': {
        'char_count': 15234,
        'line_count': 450,
        'word_count': 2800,
        'table_markers': 12,      # Count of | (tables)
        'has_structure': True     # Has markdown structure
    },
    'encrypted': False,
    'error': None
}

Handling Common Scenarios

Encrypted PDFs

The system detects encrypted PDFs and reports them:

if result['encrypted']:
    print("PDF is password-protected")

Encrypted PDFs cannot be extracted without the password.

Empty or Failed Extractions

When all backends fail:

Check if PDF is encrypted
Try with --backends pdfminer pypdf2 (most permissive)
Check PDF isn't corrupted
Consider OCR-based backends for scanned documents

Resume Batch Processing

To continue interrupted batch extraction:

extract-pdfs /path/to/pdfs/ /path/to/output/

The resume=True default skips already-extracted files.

To force re-extraction:

extract-pdfs /path/to/pdfs/ --no-resume

Tables and Structured Data

For PDFs with tables, prioritize:

extract-pdfs document.pdf --backends pdfplumber markitdown

The output will contain markdown tables when detected:

| Column1 | Column2 | Column3 |
|---------|---------|---------|
| Data    | Data    | Data    |

Module Structure Reference

Source Code Layout

Location: ${CLAUDE_PLUGIN_ROOT}/src/pdf_extraction/

File	Purpose
`__init__.py`	Package exports (extract_single_pdf, pdf_to_txt, etc.)
`__main__.py`	Support for `python -m pdf_extraction`
`cli.py`	CLI entry point with argparse
`backends.py`	BackendExtractor base class + 9 backend implementations
`extractors.py`	extract_single_pdf(), pdf_to_txt() functions
`utils.py`	GPU detection, quality metrics, encryption check

Key Classes and Functions

Component	Location	Purpose
`BackendExtractor`	backends.py:35-123	Base class with Template Method pattern
`DoclingExtractor`	backends.py:130-142	IBM Docling backend (MIT, GPU)
`MarkerExtractor`	backends.py:145-158	Vision-based marker backend (GPL-3.0, GPU)
`MarkItDownExtractor`	backends.py:161-173	Microsoft MarkItDown (MIT, CPU)
`PdfplumberExtractor`	backends.py:244-253	Table-focused extraction (MIT)
`PdfminerExtractor`	backends.py:219-226	Pure Python fallback (MIT)
`Pypdf2Extractor`	backends.py:229-241	Basic extraction, always available (BSD-3)
`BACKEND_REGISTRY`	backends.py:279-292	Dict mapping backend names to factories
`detect_gpu_availability()`	utils.py:9-40	Auto-detect GPU and recommend backends
`extract_single_pdf()`	extractors.py:13-80	Extract one PDF with backend fallback
`pdf_to_txt()`	extractors.py:83-170	Batch extract directory with resume

Key implementation details:

Backend fallback loop: extractors.py:55-78 - Tries each backend in order, stops on first success
Lazy initialization: backends.py:77-79 - Converters created only when first used
Quality metrics: utils.py:43-76 - Calculates char/word/table counts

Additional Resources

Reference Files

For detailed backend documentation and advanced patterns:

references/backends.md - Detailed backend comparison and selection guide

Example Usage

Working examples in the insurance analysis that prompted this skill:

Extracted 21 PDFs from mortgage statements and insurance policies
Used markitdown backend for fast extraction
Parsed structured data (dates, amounts, policy numbers)

Error Handling

The extraction system handles errors gracefully:

Backend failures: Automatically tries next backend
Import errors: Skips unavailable backends
File errors: Reports specific error message
Partial success: Continues with remaining files in batch

All errors are captured in metadata rather than raising exceptions.

Dependencies

Core dependencies (always available):

pdfminer.six - Pure Python PDF parser
pdfplumber - Table-aware extraction
PyPDF2 - Basic PDF operations
tqdm - Progress bars

Optional dependencies:

markitdown - Microsoft multi-format converter
docling - IBM document processor (GPU-accelerated)
marker-pdf - Vision-based extraction (GPU-accelerated)
pymupdf4llm - LLM-optimized output
pdfbox - Java-based extraction

Install all dependencies:

uv pip install "markitdown>=0.1.0" "pdfplumber>=0.10.0" "pdfminer.six>=20221105" "PyPDF2>=3.0.0" tqdm

For GPU backends:

uv pip install docling marker-pdf

Troubleshooting

`extract-pdfs: command not found`

# Install as global UV tool from repo root:
cd plugins/pdf-extractor && uv tool install --force --editable . && cd ../..
extract-pdfs --list-backends  # verify

`ModuleNotFoundError: No module named 'pdf_extraction'` (or 'markitdown', 'pdfplumber')

# Re-install with all base dependencies:
cd plugins/pdf-extractor && uv tool install --force --editable . && cd ../..
# Or install explicitly:
uv pip install "markitdown>=0.1.0" "pdfplumber>=0.10.0" "pdfminer.six>=20221105" "PyPDF2>=3.0.0" tqdm

GPU backends (docling, marker) not available

# Requires PyTorch; install GPU extras:
cd plugins/pdf-extractor && uv tool install --force --editable ".[gpu]" && cd ../..
extract-pdfs --list-backends  # verify gpu backends appear
# Note: docling downloads ~500MB models on first use; marker downloads ~1GB

Empty output from scanned PDF (image-only document)

# Scanned PDFs require OCR (GPU backends):
extract-pdfs scanned.pdf --backends marker docling
# If GPU unavailable, try pdftotext (system tool):
brew install poppler        # macOS
# apt install poppler-utils  # Ubuntu/Debian
extract-pdfs scanned.pdf --backends pdftotext

pdfminer import error (package name confusion)

# Install correct package (name has .six suffix):
uv pip install "pdfminer.six>=20221105"
# Import is still: from pdfminer.high_level import extract_text  (no .six)

markitdown version conflict

# API changed significantly in 0.1.0; ensure correct version:
uv pip install "markitdown>=0.1.0"

Source

git clone https://github.com/ahundt/autorun/blob/main/plugins/pdf-extractor/skills/pdf-extractor/SKILL.mdView on GitHub

Overview

PDF Data Extraction uses a multi-backend approach to pull text and structured data from PDFs, with automatic backend selection and optional OCR for scanned documents. It supports single-file and batch extraction and outputs Markdown-optimized text for easy downstream processing. The system can detect GPUs and adjust backend usage accordingly for performance and accuracy.

How This Skill Works

Backends are tried in a configurable order, stopping at the first successful extraction. For scanned PDFs, vision-based backends with OCR can be invoked, while regular PDFs use layout-aware extractors. Outputs are Markdown-formatted text suitable for further processing or analytics.

When to Use It

Single-file PDF text extraction using the recommended CLI.
Batch extraction of all PDFs in a directory, producing one Markdown per file.
OCR-enhanced extraction for image-based or scanned PDFs.
Converting PDFs to Markdown or extracting structured data like tables.
Experimenting with backend orders and GPU-enabled backends to optimize results.

Quick Start

Step 1: extract-pdfs /path/to/document.pdf — outputs document.md in the same directory
Step 2: extract-pdfs /path/to/pdfs/ /path/to/output/ — creates .md files for all PDFs in the directory
Step 3: extract-pdfs document.pdf output.md — custom output file

Best Practices

On non-GPU systems, use CPU-backed backends in a sensible order: markitdown, pdfplumber, pdfminer, pypdf2.
For scanned documents, elevate marker or docling in the backend order to improve accuracy.
Before large runs, run --list-backends to understand options and GPU status.
Ensure the output directory exists or specify a dedicated output file path.
Test a representative PDF first to tune the backend order for your use case.

Example Use Cases

Extract a single PDF to document.md: extract-pdfs /path/to/document.pdf
Batch extract all PDFs in /path/to/pdfs/ to /path/to/output/ (one .md per PDF)
Prioritize pdfplumber for tables: extract-pdfs document.pdf --backends markitdown pdfplumber
OCR a scanned invoice: extract-pdfs scanned.pdf --backends marker docling markitdown
List available backends and GPU status: extract-pdfs --list-backends

Frequently Asked Questions

Add this skill to your agents

pdf-extractor

PDF Data Extraction

Overview

Quick Start Workflow

Alternative Execution Methods

Backend Selection Guide

Custom Backend Ordering

CPU-Only Systems (Default)

GPU Systems

Backend Comparison

Backend Decision Matrix

Programmatic Usage

Extraction Metadata

Handling Common Scenarios

Encrypted PDFs

Empty or Failed Extractions

Resume Batch Processing

Tables and Structured Data

Module Structure Reference

Source Code Layout

Key Classes and Functions

Additional Resources

Reference Files

Example Usage

Error Handling

Dependencies

Troubleshooting

extract-pdfs: command not found

ModuleNotFoundError: No module named 'pdf_extraction' (or 'markitdown', 'pdfplumber')

GPU backends (docling, marker) not available

Empty output from scanned PDF (image-only document)

pdfminer import error (package name confusion)

markitdown version conflict

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

What is pdf-extractor used for?

Which backends are supported?

How do I customize backends or view options?

`extract-pdfs: command not found`

`ModuleNotFoundError: No module named 'pdf_extraction'` (or 'markitdown', 'pdfplumber')