pdf-handling
Scannednpx machina-cli add skill belumume/claude-skills/pdf-handling --openclawPDF Extraction
Standard: python "$CLAUDE_PLUGIN_DIR/scripts/pdf_extract.py" "file.pdf"
Unified: python "$CLAUDE_PLUGIN_DIR/scripts/pdf_extract_unified.py" "file.pdf"
Read the extracted .txt or _unified.md, not the PDF.
Source
git clone https://github.com/belumume/claude-skills/blob/main/plugins/pdf-guard/skills/pdf-handling/SKILL.mdView on GitHub Overview
pdf-handling converts PDFs into readable text and images by running dedicated extraction scripts. It ensures you work with .txt or _unified.md outputs rather than raw PDFs, simplifying downstream processing. This standardizes input for NLP pipelines and documentation.
How This Skill Works
Use either Standard or Unified extraction scripts to convert a PDF into text and image assets. The Standard script is pdf_extract.py; the Unified script is pdf_extract_unified.py. After extraction, read the resulting .txt or _unified.md files, not the original PDF.
When to Use It
- You need textual content from a PDF for analysis or summarization.
- You want a consistent input format (.txt or _unified.md) for your pipeline.
- You need to extract embedded images alongside text for context.
- Preparing literature reviews or knowledge bases from PDFs.
- Avoid processing raw PDFs directly in your agent workflow.
Quick Start
- Step 1: Run the appropriate extractor on your PDF (Standard or Unified).
- Step 2: Read the produced .txt or _unified.md file, not the PDF.
- Step 3: Use the extracted text/images for further processing.
Best Practices
- Choose the correct script (Standard or Unified) for your workflow.
- Verify extracted text encoding and review _unified.md for structure.
- Always read the generated .txt or _unified.md instead of the PDF.
- Keep a reproducible extraction step with the PDF source and output.
- Check for any extraction errors and re-run if needed.
Example Use Cases
- A researcher converts a batch of journal PDFs to .txt for NLP analysis.
- A team processes monthly financial reports into _unified.md for dashboards.
- A knowledge-base tool ingests product manuals via extracted text.
- An academic archive stores PDFs alongside their text transcripts.
- A content team summarizes e-books by extracting text for topic modeling.