Which libraries does this skill rely on?

Extraction with pdfplumber and manipulation with PyPDF2; OCR can be used for scanned PDFs if needed.

Can pdf-processing handle encrypted PDFs?

Yes, but you must detect encryption first and apply decryption steps before processing.

How do I ensure the extracted data is accurate?

Validate extracted data against source material, and use layout-preserving extraction for multi-column documents.

pdf-processing

Scanned

npx machina-cli add skill aiskillstore/marketplace/pdf-processing --openclaw

Files (1)

SKILL.md

1.2 KB

PDF Processing Skill

This skill provides capabilities for working with PDF documents.

Quick Start

Use pdfplumber to extract text from PDFs:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()

Capabilities

Text Extraction

Extract text from single or multiple pages
Preserve layout and formatting
Handle multi-column documents

Table Extraction

Identify and extract tables
Convert to structured data (CSV, JSON)
Handle complex table layouts

Form Operations

Fill PDF forms programmatically
Extract form field values
Create fillable forms

Document Operations

Merge multiple PDFs
Split PDFs by page
Rotate pages
Add watermarks

Best Practices

Always check if the PDF is encrypted before processing
Handle OCR cases for scanned documents
Validate extracted data for accuracy
Use appropriate libraries (pdfplumber for extraction, PyPDF2 for manipulation)

Source

git clone https://github.com/aiskillstore/marketplace/blob/main/skills/0xkynz/pdf-processing/SKILL.mdView on GitHub

Overview

The pdf-processing skill enables text and table extraction from PDFs, programmatic form filling, and document manipulation like merging or rotating pages. It’s essential when you work with PDFs, forms, or need to convert embedded data into structured formats.

How This Skill Works

This skill relies on libraries such as pdfplumber for text and table extraction and PyPDF2 for document manipulation. It exposes capabilities to read pages, preserve layout, fill forms, merge or split PDFs, and apply simple edits like rotations or watermarks.

When to Use It

You need to extract text from a PDF for indexing or search.
You want to convert embedded tables into CSV or JSON for analytics.
You need to fill PDF forms programmatically or extract form field values.
You must merge, split, rotate, or watermark multiple PDFs into a single document.
You’re dealing with encrypted PDFs or scanned documents requiring OCR before extraction.

Quick Start

Step 1: Install libraries: pip install pdfplumber PyPDF2
Step 2: Use pdfplumber to open a PDF and extract text from the first page
Step 3: Merge two PDFs using PyPDF2 to create a single document

Best Practices

Always check if the PDF is encrypted before processing
Handle OCR cases for scanned documents
Validate extracted data for accuracy
Use pdfplumber for extraction and PyPDF2 for manipulation
Test form fills and verify field values after operations

Example Use Cases

Extract and index text from a multi-page monthly report for a knowledge base
Identify and export tables from financial PDFs to CSV/JSON for analysis
Auto-fill a customer intake form and export the filled data to a database
Merge several quarterly PDFs into a single annual report document
Add a watermark to branded PDFs and rotate pages for print readiness

Frequently Asked Questions

Add this skill to your agents