Get the FREE Ultimate OpenClaw Setup Guide →

pdf-manipulation

Scanned
npx machina-cli add skill besoeasy/open-skills/pdf-manipulation --openclaw
Files (1)
SKILL.md
9.7 KB

PDF Manipulation Skill

Merge, split, extract, redact, and transform PDF files using free command-line tools and libraries. Covers common PDF operations for document automation workflows.

When to use

  • Merge multiple PDFs into one document
  • Split large PDFs into separate files or page ranges
  • Extract text, images, or specific pages
  • Redact sensitive information
  • Add watermarks, passwords, or metadata
  • Convert PDFs to images or other formats

Required tools

  • pdftk — Swiss Army knife for PDF manipulation (merge, split, rotate, encrypt)
  • qpdf — PDF transformation and encryption (linearize, decrypt, repair)
  • pdftotext / pdfimages — Part of poppler-utils (extract text and images)
  • ghostscript (gs) — Advanced PDF processing, compression, and conversion

Installation

# Ubuntu/Debian
sudo apt-get install pdftk qpdf poppler-utils ghostscript

# macOS (Homebrew)
brew install pdftk-java qpdf poppler ghostscript

# For Node.js: npm i pdf-lib (pure JS, no system deps)
# For Python: pip install PyPDF2 pypdf

Skills

Merge PDFs

# Using pdftk (preserves bookmarks, forms)
pdftk file1.pdf file2.pdf file3.pdf cat output merged.pdf

# Using ghostscript (better compression)
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=merged.pdf file1.pdf file2.pdf file3.pdf

# Using qpdf (preserves structure)
qpdf --empty --pages file1.pdf file2.pdf file3.pdf -- merged.pdf

Node.js (pdf-lib):

const { PDFDocument } = require('pdf-lib');
const fs = require('fs');

async function mergePDFs(files, output) {
  const mergedPdf = await PDFDocument.create();
  
  for (const file of files) {
    const pdfBytes = fs.readFileSync(file);
    const pdf = await PDFDocument.load(pdfBytes);
    const pages = await mergedPdf.copyPages(pdf, pdf.getPageIndices());
    pages.forEach(page => mergedPdf.addPage(page));
  }
  
  const mergedBytes = await mergedPdf.save();
  fs.writeFileSync(output, mergedBytes);
}

// mergePDFs(['file1.pdf', 'file2.pdf'], 'merged.pdf');

Split PDF (by page or range)

# Split every page into separate files
pdftk input.pdf burst output page_%02d.pdf

# Extract specific pages (e.g., pages 1-5 and 10)
pdftk input.pdf cat 1-5 10 output subset.pdf

# Extract page ranges with qpdf
qpdf input.pdf --pages . 1-5 -- output.pdf

# Split every N pages (e.g., every 2 pages)
pdftk input.pdf burst
# then manually combine or script it

Node.js (pdf-lib):

const { PDFDocument } = require('pdf-lib');
const fs = require('fs');

async function extractPages(inputPath, pages, outputPath) {
  const pdfBytes = fs.readFileSync(inputPath);
  const pdfDoc = await PDFDocument.load(pdfBytes);
  const newPdf = await PDFDocument.create();
  
  for (const pageNum of pages) {
    const [page] = await newPdf.copyPages(pdfDoc, [pageNum - 1]);
    newPdf.addPage(page);
  }
  
  const newBytes = await newPdf.save();
  fs.writeFileSync(outputPath, newBytes);
}

// extractPages('input.pdf', [1, 3, 5], 'output.pdf');

Extract text

# Extract all text (preserves layout)
pdftotext input.pdf output.txt

# Extract text as raw (no layout)
pdftotext -raw input.pdf output.txt

# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt

# Using qpdf + pdftotext
pdftotext -layout input.pdf -

Node.js (pdf-parse):

const fs = require('fs');
const pdf = require('pdf-parse');

async function extractText(filePath) {
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdf(dataBuffer);
  return data.text;
}

// extractText('input.pdf').then(console.log);

Extract images

# Extract all images from PDF
pdfimages -all input.pdf output_prefix

# Output: output_prefix-000.png, output_prefix-001.jpg, etc.

# Extract only JPEGs
pdfimages -j input.pdf output_prefix

Redact / Remove pages

# Remove specific pages (e.g., remove pages 2-4)
pdftk input.pdf cat 1 5-end output redacted.pdf

# Keep only specific pages
pdftk input.pdf cat 1-10 20-30 output selected.pdf

Add password protection

# Encrypt PDF with password
pdftk input.pdf output secured.pdf user_pw mypassword

# Remove password
pdftk secured.pdf input_pw mypassword output unlocked.pdf

# Using qpdf (AES-256)
qpdf --encrypt userpass ownerpass 256 -- input.pdf output.pdf

Node.js (pdf-lib):

const { PDFDocument } = require('pdf-lib');
const fs = require('fs');

async function encryptPDF(inputPath, password, outputPath) {
  const pdfBytes = fs.readFileSync(inputPath);
  const pdfDoc = await PDFDocument.load(pdfBytes);
  
  const encryptedBytes = await pdfDoc.save({
    userPassword: password,
    ownerPassword: password
  });
  
  fs.writeFileSync(outputPath, encryptedBytes);
}

Rotate pages

# Rotate all pages 90 degrees clockwise
pdftk input.pdf cat 1-endright output rotated.pdf

# Rotate specific pages
pdftk input.pdf cat 1-5 6right 7-end output rotated.pdf

# Options: right (90°), left (270°), down (180°)

Compress / Reduce file size

# Using ghostscript (adjust quality)
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook \
   -dNOPAUSE -dQUIET -dBATCH -sOutputFile=compressed.pdf input.pdf

# Quality settings:
#   /screen   - low quality (72 dpi)
#   /ebook    - medium (150 dpi)
#   /printer  - high (300 dpi)
#   /prepress - highest (300 dpi, preserves color)

# Using qpdf (lossless compression)
qpdf --linearize --object-streams=generate input.pdf compressed.pdf

Convert PDF to images

# Convert each page to PNG (300 DPI)
pdftoppm -png -r 300 input.pdf output_prefix

# Output: output_prefix-1.png, output_prefix-2.png, etc.

# Convert to JPEG
pdftoppm -jpeg -r 150 input.pdf output_prefix

# Using ImageMagick (alternative)
convert -density 300 input.pdf output_%03d.png

Add watermark

# Overlay watermark.pdf on every page
pdftk input.pdf stamp watermark.pdf output watermarked.pdf

# Background watermark (behind content)
pdftk input.pdf background watermark.pdf output watermarked.pdf

# Watermark specific pages only
pdftk input.pdf multistamp watermark.pdf output watermarked.pdf

Get PDF metadata

# Using pdftk
pdftk input.pdf dump_data

# Using qpdf
qpdf --show-object=1 input.pdf

# Using pdfinfo (poppler-utils)
pdfinfo input.pdf

Multi-operation script (Node.js)

const { PDFDocument } = require('pdf-lib');
const fs = require('fs');

class PDFHelper {
  static async merge(files, output) {
    const merged = await PDFDocument.create();
    for (const file of files) {
      const pdf = await PDFDocument.load(fs.readFileSync(file));
      const pages = await merged.copyPages(pdf, pdf.getPageIndices());
      pages.forEach(p => merged.addPage(p));
    }
    fs.writeFileSync(output, await merged.save());
  }

  static async split(input, ranges, output) {
    const pdf = await PDFDocument.load(fs.readFileSync(input));
    const newPdf = await PDFDocument.create();
    const pages = await newPdf.copyPages(pdf, ranges);
    pages.forEach(p => newPdf.addPage(p));
    fs.writeFileSync(output, await newPdf.save());
  }

  static async info(input) {
    const pdf = await PDFDocument.load(fs.readFileSync(input));
    return {
      pages: pdf.getPageCount(),
      title: pdf.getTitle(),
      author: pdf.getAuthor(),
      creator: pdf.getCreator()
    };
  }
}

module.exports = PDFHelper;

Agent prompt

You have PDF manipulation skills. When a user requests PDF operations:

1. Detect the operation: merge, split, extract (text/images/pages), redact, compress, encrypt, rotate, watermark, or get info.
2. Use appropriate tools:
   - pdftk for merge, split, rotate, encrypt, watermark
   - pdftotext/pdfimages for extraction
   - ghostscript for compression
   - qpdf for repair and advanced operations
3. Always validate input files exist before processing.
4. For scripting, prefer pdf-lib (Node.js) or PyPDF2 (Python) for portability.
5. Return structured output (file paths, metadata, text) in JSON format.

Best practices

  • Validate PDFs before processing (use qpdf --check input.pdf).
  • Preserve metadata when possible (use pdftk or pdf-lib, avoid ghostscript for simple operations).
  • Use appropriate compression — ghostscript /ebook is a good balance for most cases.
  • Security — Always remove passwords before processing if user provides them; never log passwords.
  • Large files — For 100+ page PDFs, process in chunks or use streaming APIs.

Common workflows

Invoice processing

# 1. Extract text for parsing
pdftotext invoice.pdf invoice.txt

# 2. Extract first page only (summary)
pdftk invoice.pdf cat 1 output summary.pdf

# 3. Compress for archival
gs -sDEVICE=pdfwrite -dPDFSETTINGS=/ebook -dBATCH -dNOPAUSE -q \
   -sOutputFile=invoice_compressed.pdf invoice.pdf

Batch processing

# Merge all PDFs in a directory
pdftk *.pdf cat output combined.pdf

# Split each PDF in directory into individual pages
for f in *.pdf; do
  pdftk "$f" burst output "${f%.pdf}_page_%02d.pdf"
done

# Extract text from all PDFs
for f in *.pdf; do
  pdftotext "$f" "${f%.pdf}.txt"
done

Troubleshooting

  • Corrupted PDF: Use qpdf --check then qpdf input.pdf --replace-input to repair.
  • Encrypted PDF: Remove password first with qpdf --decrypt --password=PASS input.pdf output.pdf.
  • Large file size: Use ghostscript compression or remove embedded fonts/images if not needed.
  • Missing fonts: Install fonts-liberation or msttcorefonts packages.

See also

Source

git clone https://github.com/besoeasy/open-skills/blob/main/skills/pdf-manipulation/SKILL.mdView on GitHub

Overview

Merge, split, extract, redact, and transform PDFs using free command-line tools and libraries. This skill covers common document automation workflows and supports both CLI workflows and code libraries (Node.js, Python).

How This Skill Works

Operate via CLI tools such as pdftk, qpdf, pdftotext, pdfimages, and Ghostscript for file-level operations, or use code libraries like pdf-lib (Node.js) or PyPDF2/pypdf (Python) to merge, split, extract data, or redact. The skill provides practical commands and code samples for common tasks, ensuring compatibility with bookmarks, metadata, and compression needs.

When to Use It

  • Merge multiple PDFs into one document
  • Split large PDFs into separate files or page ranges
  • Extract text, images, or specific pages
  • Redact sensitive information
  • Convert PDFs to images or other formats

Quick Start

  1. Step 1: Install tools (pdftk, qpdf, poppler-utils, Ghostscript).
  2. Step 2: Choose an operation (e.g., merge: pdftk a.pdf b.pdf cat output merged.pdf).
  3. Step 3: Run the command and verify the output (check page count, metadata, and bookmarks).

Best Practices

  • Choose the right tool for the task (pdftk for merging, qpdf for preserving structure, Ghostscript for compression).
  • When preserving bookmarks and forms, prefer qpdf or Ghostscript-based workflows over simple concatenation.
  • Validate outputs by checking page counts, metadata, and accessibility of bookmarks after each operation.
  • Automate repetitive tasks with lightweight Node.js or Python wrappers using pdf-lib or PyPDF2/pypdf.
  • Redact carefully: remove the content and sanitize metadata to prevent recovery.

Example Use Cases

  • Merge client reports into a single PDF for distribution.
  • Split a large scanned report into per-client documents.
  • Extract text for indexing and search optimization.
  • Redact personally identifiable information before sharing with partners.
  • Convert PDFs to images for web thumbnails or slide decks.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers