What is pdf-handling?

A skill that extracts text and images from PDFs using dedicated scripts to produce readable outputs.

Which files should I read after extraction?

Read the extracted .txt or _unified.md files rather than the original PDF.

pdf-handling

Scanned

npx machina-cli add skill belumume/claude-skills/pdf-handling --openclaw

Files (1)

SKILL.md

314 B

PDF Extraction

Standard: python "$CLAUDE_PLUGIN_DIR/scripts/pdf_extract.py" "file.pdf" Unified: python "$CLAUDE_PLUGIN_DIR/scripts/pdf_extract_unified.py" "file.pdf"

Read the extracted .txt or _unified.md, not the PDF.

Source

git clone https://github.com/belumume/claude-skills/blob/main/plugins/pdf-guard/skills/pdf-handling/SKILL.mdView on GitHub

Overview

pdf-handling converts PDFs into readable text and images by running dedicated extraction scripts. It ensures you work with .txt or _unified.md outputs rather than raw PDFs, simplifying downstream processing. This standardizes input for NLP pipelines and documentation.

How This Skill Works

Use either Standard or Unified extraction scripts to convert a PDF into text and image assets. The Standard script is pdf_extract.py; the Unified script is pdf_extract_unified.py. After extraction, read the resulting .txt or _unified.md files, not the original PDF.

When to Use It

You need textual content from a PDF for analysis or summarization.
You want a consistent input format (.txt or _unified.md) for your pipeline.
You need to extract embedded images alongside text for context.
Preparing literature reviews or knowledge bases from PDFs.
Avoid processing raw PDFs directly in your agent workflow.

Quick Start

Step 1: Run the appropriate extractor on your PDF (Standard or Unified).
Step 2: Read the produced .txt or _unified.md file, not the PDF.
Step 3: Use the extracted text/images for further processing.

Best Practices

Choose the correct script (Standard or Unified) for your workflow.
Verify extracted text encoding and review _unified.md for structure.
Always read the generated .txt or _unified.md instead of the PDF.
Keep a reproducible extraction step with the PDF source and output.
Check for any extraction errors and re-run if needed.

Example Use Cases

A researcher converts a batch of journal PDFs to .txt for NLP analysis.
A team processes monthly financial reports into _unified.md for dashboards.
A knowledge-base tool ingests product manuals via extracted text.
An academic archive stores PDFs alongside their text transcripts.
A content team summarizes e-books by extracting text for topic modeling.

Frequently Asked Questions

Add this skill to your agents