PDF Reader (Iyeque)
Verified@iyeque
npx machina-cli add skill @iyeque/iyeque-pdf-reader --openclawPDF Reader Skill
The pdf-reader skill provides functionality to extract text and retrieve metadata from PDF files using PyMuPDF (fitz).
Tool API
The skill provides two commands:
extract
Extracts plain text from the specified PDF file.
- Parameters:
file_path(string, required): Path to the PDF file to extract text from.--max_pages(integer, optional): Maximum number of pages to extract.
Usage:
python3 skills/pdf-reader/reader.py extract /path/to/document.pdf
python3 skills/pdf-reader/reader.py extract /path/to/document.pdf --max_pages 5
Output: Plain text content from the PDF.
metadata
Retrieve metadata about the document.
- Parameters:
file_path(string, required): Path to the PDF file.
Usage:
python3 skills/pdf-reader/reader.py metadata /path/to/document.pdf
Output: JSON object with PDF metadata including:
title: Document titleauthor: Document authorsubject: Document subjectcreator: Application that created the PDFproducer: PDF producercreationDate: Creation datemodDate: Modification dateformat: PDF format versionencryption: Encryption info (if any)
Implementation Notes
- Uses PyMuPDF (imported as
pymupdf) for fast, reliable PDF processing - Supports encrypted PDFs (will return error if password required)
- Handles large PDFs efficiently with
max_pagesoption - Returns structured JSON for metadata command
Example
# Extract text from first 3 pages
python3 skills/pdf-reader/reader.py extract report.pdf --max_pages 3
# Get document metadata
python3 skills/pdf-reader/reader.py metadata report.pdf
# Output:
# {
# "title": "Annual Report 2024",
# "author": "John Doe",
# "creationDate": "D:20240115120000",
# ...
# }
Error Handling
- Returns error message if file not found or not a valid PDF
- Returns error if PDF is encrypted and requires password
- Gracefully handles corrupted or malformed PDFs
Overview
The PDF Reader uses PyMuPDF to extract plain text and retrieve document metadata from PDFs, enabling searchable, indexable content workflows. It supports large files via a max_pages limit and returns structured outputs for text or metadata, while gracefully signaling encrypted or corrupted files. This makes it easy to build summaries and quick previews from PDFs.
How This Skill Works
Two commands powered by PyMuPDF drive the functionality: extract returns plain text from a PDF (with an optional --max_pages limit), and metadata returns a JSON object with fields like title, author, subject, creator, producer, creationDate, modDate, format, and encryption. The tool is optimized for speed and can handle encrypted PDFs by returning an error when a password is required; large PDFs are efficiently processed by limiting pages when needed.
When to Use It
- Index a folder of PDFs for fast in-app search by extracting text from each file
- Preview a document by pulling only the first N pages with --max_pages
- Catalog PDFs by retrieving metadata for library or archive records
- Build a summarization workflow by exporting text to a summarizer or prep step
- Diagnose or validate PDFs by attempting extract/metadata and handling encryption or corruption errors
Quick Start
- Step 1: Install PyMuPDF (pip install PyMuPDF)
- Step 2: Run extract or metadata on a PDF, e.g. python3 skills/pdf-reader/reader.py extract document.pdf --max_pages 5
- Step 3: Use the output (text or JSON metadata) in your downstream workflow or indexer
Best Practices
- Always specify --max_pages for very large PDFs to avoid heavy processing
- Validate file_path exists and is a real PDF before running commands
- Use metadata output to populate catalogs with fields like title, author, and creationDate
- Gracefully handle errors for encrypted or corrupted PDFs and log the issue
- Store outputs as plain text for extraction and JSON for metadata to enable downstream tooling
Example Use Cases
- Extract the first 3 pages of report.pdf: python3 skills/pdf-reader/reader.py extract report.pdf --max_pages 3
- Get metadata for annual_report.pdf: python3 skills/pdf-reader/reader.py metadata annual_report.pdf
- Batch index PDFs by looping through a folder and extracting text for search indexing
- Attempt to extract from encrypted.pdf to observe an encryption error and handle it gracefully
- Preview a long whitepaper by extracting the first 10 pages: python3 skills/pdf-reader/reader.py extract whitepaper.pdf --max_pages 10