pdf-processing
Scannednpx machina-cli add skill aiskillstore/marketplace/pdf-processing --openclawFiles (1)
SKILL.md
1.2 KB
PDF Processing Skill
This skill provides capabilities for working with PDF documents.
Quick Start
Use pdfplumber to extract text from PDFs:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
text = pdf.pages[0].extract_text()
Capabilities
Text Extraction
- Extract text from single or multiple pages
- Preserve layout and formatting
- Handle multi-column documents
Table Extraction
- Identify and extract tables
- Convert to structured data (CSV, JSON)
- Handle complex table layouts
Form Operations
- Fill PDF forms programmatically
- Extract form field values
- Create fillable forms
Document Operations
- Merge multiple PDFs
- Split PDFs by page
- Rotate pages
- Add watermarks
Best Practices
- Always check if the PDF is encrypted before processing
- Handle OCR cases for scanned documents
- Validate extracted data for accuracy
- Use appropriate libraries (pdfplumber for extraction, PyPDF2 for manipulation)
Source
git clone https://github.com/aiskillstore/marketplace/blob/main/skills/0xkynz/pdf-processing/SKILL.mdView on GitHub Overview
The pdf-processing skill enables text and table extraction from PDFs, programmatic form filling, and document manipulation like merging or rotating pages. It’s essential when you work with PDFs, forms, or need to convert embedded data into structured formats.
How This Skill Works
This skill relies on libraries such as pdfplumber for text and table extraction and PyPDF2 for document manipulation. It exposes capabilities to read pages, preserve layout, fill forms, merge or split PDFs, and apply simple edits like rotations or watermarks.
When to Use It
- You need to extract text from a PDF for indexing or search.
- You want to convert embedded tables into CSV or JSON for analytics.
- You need to fill PDF forms programmatically or extract form field values.
- You must merge, split, rotate, or watermark multiple PDFs into a single document.
- You’re dealing with encrypted PDFs or scanned documents requiring OCR before extraction.
Quick Start
- Step 1: Install libraries: pip install pdfplumber PyPDF2
- Step 2: Use pdfplumber to open a PDF and extract text from the first page
- Step 3: Merge two PDFs using PyPDF2 to create a single document
Best Practices
- Always check if the PDF is encrypted before processing
- Handle OCR cases for scanned documents
- Validate extracted data for accuracy
- Use pdfplumber for extraction and PyPDF2 for manipulation
- Test form fills and verify field values after operations
Example Use Cases
- Extract and index text from a multi-page monthly report for a knowledge base
- Identify and export tables from financial PDFs to CSV/JSON for analysis
- Auto-fill a customer intake form and export the filled data to a database
- Merge several quarterly PDFs into a single annual report document
- Add a watermark to branded PDFs and rotate pages for print readiness
Frequently Asked Questions
Add this skill to your agents