Get the FREE Ultimate OpenClaw Setup Guide →

docling

Scanned
npx machina-cli add skill existential-birds/beagle/docling --openclaw
Files (1)
SKILL.md
10.1 KB

Docling Document Parser

Docling is a document parsing library that converts PDFs, Word documents, PowerPoint, images, and other formats into structured data with advanced layout understanding.

Quick Start

Basic document conversion:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # URL, Path, or BytesIO
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

Core Concepts

DocumentConverter

The main entry point for document conversion. Supports various input formats and conversion options.

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Basic converter (all formats enabled)
converter = DocumentConverter()

# Restricted formats
converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX]
)

# Custom pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

ConversionResult

All conversion operations return a ConversionResult containing:

  • document: The parsed DoclingDocument
  • status: ConversionStatus.SUCCESS, PARTIAL_SUCCESS, or FAILURE
  • errors: List of errors encountered during conversion
  • input: Information about the source document
result = converter.convert("document.pdf")

if result.status == ConversionStatus.SUCCESS:
    markdown = result.document.export_to_markdown()
    html = result.document.export_to_html()
    data = result.document.export_to_dict()

Supported Formats

Input Formats

  • Documents: PDF, DOCX, PPTX, XLSX
  • Markup: HTML, Markdown, AsciiDoc
  • Data: CSV, JSON (Docling format)
  • Images: PNG, JPEG, TIFF, BMP, WEBP
  • Audio: WAV, MP3
  • Video Text: WebVTT
  • Schema-specific: USPTO XML, JATS XML, METS-GBS

Output Formats

  • Markdown: export_to_markdown() or save_as_markdown()
  • HTML: export_to_html() or save_as_html()
  • JSON: export_to_dict() or save_as_json() (note: no export_to_json() method)
  • Text: export_to_text() or export_to_markdown(strict_text=True) or save_as_markdown(strict_text=True)
  • DocTags: export_to_doctags() or save_as_doctags()

Common Patterns

Single File Conversion

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to different formats
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
json_data = result.document.export_to_dict()

# Or save directly to file
result.document.save_as_markdown("output.md")
result.document.save_as_html("output.html")
result.document.save_as_json("output.json")

Batch Processing

See references/batch.md for details on convert_all().

URL Conversion

converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")

Binary Stream Conversion

from io import BytesIO
from docling.datamodel.base_models import DocumentStream

with open("document.pdf", "rb") as f:
    buf = BytesIO(f.read())

source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)

Format-Specific Options

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

# Configure PDF-specific options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "es"]
pipeline_options.do_table_structure = True
pipeline_options.generate_page_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

Resource Limits

converter = DocumentConverter()

# Limit file size (bytes) and page count
result = converter.convert(
    "large_document.pdf",
    max_file_size=20_971_520,  # 20 MB
    max_num_pages=100
)

Document Chunking

See references/chunking.md for RAG integration.

DoclingDocument Structure

The DoclingDocument is a Pydantic model representing parsed content:

# Access document structure
doc = result.document

# Content items (lists)
doc.texts         # TextItem instances (paragraphs, headings, etc.)
doc.tables        # TableItem instances
doc.pictures      # PictureItem instances
doc.key_value_items  # Key-value pairs

# Structure (tree nodes)
doc.body          # Main content hierarchy
doc.furniture     # Headers, footers, page numbers
doc.groups        # Lists, chapters, sections

# Iterate all elements in reading order
for item, level in doc.iterate_items():
    print(f"{'  ' * level}{item.label}: {item.text[:50]}")

Advanced Features

OCR Configuration

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
    TesseractOcrOptions,
    TesseractCliOcrOptions,
    OcrMacOptions,
    RapidOcrOptions
)

# EasyOCR (default)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])

# Tesseract
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])

# RapidOCR
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = RapidOcrOptions()

Table Extraction Options

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableFormerMode
)

pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = True

# Use cell matching (map to PDF cells)
pipeline_options.table_structure_options.do_cell_matching = True

# Or use predicted cells
pipeline_options.table_structure_options.do_cell_matching = False

# Choose accuracy mode
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

Page Images

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True  # Needed for HTML export with images

# Export with embedded images
result.document.save_as_html(
    "output.html",
    image_mode=ImageRefMode.EMBEDDED
)

Error Handling

from docling.datamodel.base_models import ConversionStatus

result = converter.convert("document.pdf")

if result.status == ConversionStatus.SUCCESS:
    print("Conversion successful")
elif result.status == ConversionStatus.PARTIAL_SUCCESS:
    print("Partial conversion:")
    for error in result.errors:
        print(f"  {error.error_message}")
else:  # FAILURE
    print("Conversion failed:")
    for error in result.errors:
        print(f"  {error.error_message}")

For batch processing with error handling:

# Continue processing on errors
results = converter.convert_all(
    ["doc1.pdf", "doc2.pdf", "doc3.pdf"],
    raises_on_error=False
)

for result in results:
    if result.status == ConversionStatus.SUCCESS:
        result.document.save_as_markdown(f"{result.input.file.stem}.md")
    else:
        print(f"Failed: {result.input.file}")

CLI Usage

# Basic conversion
docling document.pdf

# Convert to specific output
docling --to markdown document.pdf

# With custom model path
docling --artifacts-path /path/to/models document.pdf

# Using VLM pipeline
docling --pipeline vlm --vlm-model granite_docling document.pdf

Reference Documentation

  • Parsing Options - DocumentConverter initialization, format-specific options, OCR configuration
  • Batch Processing - convert_all(), error handling, concurrency patterns
  • Chunking - HierarchicalChunker, HybridChunker, RAG integration
  • Output Formats - export_to_markdown(), export_to_html(), export_to_dict(), document structure

Key Types

  • DocumentConverter: Main conversion class
  • ConversionResult: Result of conversion with document and status
  • DoclingDocument: Unified document representation (Pydantic model)
  • InputFormat: Enum of supported input formats
  • ConversionStatus: SUCCESS, PARTIAL_SUCCESS, FAILURE
  • PdfPipelineOptions: Configuration for PDF pipeline
  • ImageRefMode: EMBEDDED, REFERENCED, PLACEHOLDER

Integration Examples

LangChain

from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownTextSplitter

converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)

LlamaIndex

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from llama_index.core import Document

converter = DocumentConverter()
result = converter.convert("document.pdf")

chunker = HybridChunker()
chunks = list(chunker.chunk(result.document))

documents = [
    Document(text=chunk.text, metadata=chunk.meta.export_json_dict())
    for chunk in chunks
]

Notes

  • Docling uses a synchronous API (no native async support)
  • Models are downloaded automatically on first use (can be prefetched)
  • Supports local execution for air-gapped environments
  • Supports GPU acceleration for OCR and table detection
  • Default models run on CPU; GPU requires configuration

Source

git clone https://github.com/existential-birds/beagle/blob/main/plugins/beagle-core/skills/docling/SKILL.mdView on GitHub

Overview

Docling parses PDFs, DOCX, PPTX, HTML, images, and 15+ formats into structured data with advanced layout understanding. It supports extracting text, converting to Markdown, HTML, or JSON, and chunking content for RAG pipelines or batch processing. The library is driven by a DocumentConverter and exposes standardized ConversionResult data.

How This Skill Works

Docling provides a DocumentConverter as the main entry point to process diverse inputs with configurable pipelines. It outputs a ConversionResult containing the parsed document, status, and errors, along with export methods like export_to_markdown, export_to_html, and export_to_dict for downstream use. You can tailor behavior with format_options and allowed formats to fit your workflow.

When to Use It

  • Ingest PDFs, DOCX, PPTX, or image-based documents to extract text and structure.
  • Convert documents into Markdown, HTML, or JSON for CMS, analytics, or data pipelines.
  • Chunk long documents for RAG pipelines using HierarchicalChunker or HybridChunker.
  • Batch-process large collections of documents with convert_all for automation.
  • Handle multiple formats through a single API and produce consistent outputs for downstream apps.

Quick Start

  1. Step 1: Install and import DocumentConverter from docling and prepare your source (URL, path, or bytes).
  2. Step 2: Create a DocumentConverter (optionally set allowed_formats) and call converter.convert(source).
  3. Step 3: Access outputs via result.document.export_to_markdown(), export_to_html(), or export_to_dict(), or save to files.

Best Practices

  • Enable OCR when processing scanned or image-based documents to improve text extraction.
  • Use format_options to tune OCR, table structure, or other pipeline options per input type.
  • Choose an appropriate chunking strategy (HierarchicalChunker vs HybridChunker) based on document length and search needs.
  • Validate outputs with export_to_markdown, export_to_html, or export_to_dict and check ConversionResult.errors.
  • For large collections, leverage convert_all for batch processing and monitor per-document status.

Example Use Cases

  • Convert a research paper PDF to Markdown and JSON for ingestion into a knowledge base.
  • Extract content from a PPTX deck and export it as HTML for a web presentation.
  • Batch-process a folder of scanned invoices, using OCR to produce DocLing JSON for accounting systems.
  • Transform HTML pages into Markdown with structured headings for a static site generator.
  • Use HierarchicalChunker to prepare a long technical report for a RAG-based search application.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers