Get the FREE Ultimate OpenClaw Setup Guide →

document-converter

npx machina-cli add skill pablodiegoo/Data-Pro-Skill/document-converter --openclaw
Files (1)
SKILL.md
1.7 KB

Document Converter

Skill for importing external documents (PDF/DOCX/PPTX) to Markdown and exporting analysis results to professional reports (PDF/DOCX).

1. IMPORT: External Docs → Markdown

Uses markdowner.py with optional OCR fallback.

python3 .agent/skills/document-converter/scripts/markdowner.py input.pdf [--ocr]

2. EXPORT: Markdown → Final Report

Uses compile_report.py for standard reports or Quarto for premium reports.

# Standard PDF
python3 .agent/skills/document-converter/scripts/compile_report.py report.md --format pdf

Detailed Guides & Reference

Assets

  • Quarto Templates: See assets/quarto-templates/ for base structure.

Dependencies

System Packages

sudo apt install poppler-utils tesseract-ocr pandoc texlive-xetex texlive-fonts-extra

Python Packages

pip install pypandoc pdfminer.six pdf2image pytesseract python-pptx Pillow

File Structure

.agent/skills/document-converter/
├── SKILL.md
├── assets/          # Templates and branding
├── references/      # Report manuals
│   ├── quarto_reports.md
│   └── troubleshooting.md
└── scripts/
    ├── markdowner.py      # Import engine
    └── compile_report.py  # Export engine

Source

git clone https://github.com/pablodiegoo/Data-Pro-Skill/blob/main/src/datapro/data/skills/document-converter/SKILL.mdView on GitHub

Overview

Converts PDFs, DOCX, and PPTX into Markdown, with OCR fallback for image-based pages. It can export Markdown as professional PDFs or DOCX reports, using standard templates or premium Quarto reports with cover pages and branding.

How This Skill Works

It uses markdowner.py to extract content into Markdown (with an optional --ocr flag for scanned documents). For exporting, it leverages compile_report.py for standard PDFs or a Quarto-based pipeline for premium reports, applying templates and branding during rendering.

When to Use It

  • Convert external PDFs, DOCX, or PPTX into clean Markdown for analysis or content reuse
  • Generate a professional PDF or DOCX report from a Markdown analysis result
  • Create branded reports with cover pages and themes via standard templates or premium Quarto reports
  • Process scanned documents that require OCR to extract text
  • Reuse Markdown content across formats for multiple projects

Quick Start

  1. Step 1: Import a document: python3 .agent/skills/document-converter/scripts/markdowner.py input.pdf [--ocr]
  2. Step 2: Edit the resulting Markdown as needed
  3. Step 3: Export: python3 .agent/skills/document-converter/scripts/compile_report.py report.md --format pdf

Best Practices

  • Use OCR when dealing with image-based or scanned documents to improve text extraction
  • Start from a clean Markdown source and keep metadata consistent for easier exports
  • Leverage standard compile_report.py for basic PDFs and Quarto templates for premium branding
  • Keep branding assets up to date in assets/quarto-templates/ for consistent visuals
  • Verify dependencies (system packages and Python libraries) before running conversions

Example Use Cases

  • Convert a client brochure from PDF to Markdown for content extraction and republishing
  • Generate a formal project report PDF from a Markdown analysis result
  • Produce a premium branded Quarto report with a custom cover page from Markdown
  • Turn a PPTX slide deck into Markdown to repurpose content into a report
  • Create a branded report from Markdown using templates and a cover page

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers