What formats does Docling support?

Docling supports PDFs, DOCX, PPTX, XLSX, HTML, Markdown, images (PNG/JPG/TIFF/BMP/WEBP), audio (WAV/MP3), WebVTT, and various XML schemas, totaling 15+ formats.

How do I get Markdown, HTML, or JSON from a docling result?

Use the ConversionResult.document methods: export_to_markdown(), export_to_html(), and export_to_dict() (or their save_* counterparts).

What is convert_all used for?

convert_all is for batch processing multiple documents. See references/batch.md for details and workflow guidelines.

docling

Scanned

npx machina-cli add skill existential-birds/beagle/docling --openclaw

Files (1)

SKILL.md

10.1 KB

Docling Document Parser

Docling is a document parsing library that converts PDFs, Word documents, PowerPoint, images, and other formats into structured data with advanced layout understanding.

Quick Start

Basic document conversion:

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # URL, Path, or BytesIO
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

Core Concepts

DocumentConverter

The main entry point for document conversion. Supports various input formats and conversion options.

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Basic converter (all formats enabled)
converter = DocumentConverter()

# Restricted formats
converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX]
)

# Custom pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

ConversionResult

All conversion operations return a ConversionResult containing:

document: The parsed DoclingDocument
status: ConversionStatus.SUCCESS, PARTIAL_SUCCESS, or FAILURE
errors: List of errors encountered during conversion
input: Information about the source document

result = converter.convert("document.pdf")

if result.status == ConversionStatus.SUCCESS:
    markdown = result.document.export_to_markdown()
    html = result.document.export_to_html()
    data = result.document.export_to_dict()

Supported Formats

Input Formats

Documents: PDF, DOCX, PPTX, XLSX
Markup: HTML, Markdown, AsciiDoc
Data: CSV, JSON (Docling format)
Images: PNG, JPEG, TIFF, BMP, WEBP
Audio: WAV, MP3
Video Text: WebVTT
Schema-specific: USPTO XML, JATS XML, METS-GBS

Output Formats

Markdown: export_to_markdown() or save_as_markdown()
HTML: export_to_html() or save_as_html()
JSON: export_to_dict() or save_as_json() (note: no export_to_json() method)
Text: export_to_text() or export_to_markdown(strict_text=True) or save_as_markdown(strict_text=True)
DocTags: export_to_doctags() or save_as_doctags()

Common Patterns

Single File Conversion

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")

# Export to different formats
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
json_data = result.document.export_to_dict()

# Or save directly to file
result.document.save_as_markdown("output.md")
result.document.save_as_html("output.html")
result.document.save_as_json("output.json")

Batch Processing

See references/batch.md for details on convert_all().

URL Conversion

converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")

Binary Stream Conversion

from io import BytesIO
from docling.datamodel.base_models import DocumentStream

with open("document.pdf", "rb") as f:
    buf = BytesIO(f.read())

source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)

Format-Specific Options

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

# Configure PDF-specific options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "es"]
pipeline_options.do_table_structure = True
pipeline_options.generate_page_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

Resource Limits

converter = DocumentConverter()

# Limit file size (bytes) and page count
result = converter.convert(
    "large_document.pdf",
    max_file_size=20_971_520,  # 20 MB
    max_num_pages=100
)

Document Chunking

See references/chunking.md for RAG integration.

DoclingDocument Structure

The DoclingDocument is a Pydantic model representing parsed content:

# Access document structure
doc = result.document

# Content items (lists)
doc.texts         # TextItem instances (paragraphs, headings, etc.)
doc.tables        # TableItem instances
doc.pictures      # PictureItem instances
doc.key_value_items  # Key-value pairs

# Structure (tree nodes)
doc.body          # Main content hierarchy
doc.furniture     # Headers, footers, page numbers
doc.groups        # Lists, chapters, sections

# Iterate all elements in reading order
for item, level in doc.iterate_items():
    print(f"{'  ' * level}{item.label}: {item.text[:50]}")

Advanced Features

OCR Configuration

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
    TesseractOcrOptions,
    TesseractCliOcrOptions,
    OcrMacOptions,
    RapidOcrOptions
)

# EasyOCR (default)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])

# Tesseract
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])

# RapidOCR
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = RapidOcrOptions()

Table Extraction Options

from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableFormerMode
)

pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = True

# Use cell matching (map to PDF cells)
pipeline_options.table_structure_options.do_cell_matching = True

# Or use predicted cells
pipeline_options.table_structure_options.do_cell_matching = False

# Choose accuracy mode
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

Page Images

pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True  # Needed for HTML export with images

# Export with embedded images
result.document.save_as_html(
    "output.html",
    image_mode=ImageRefMode.EMBEDDED
)

Error Handling

from docling.datamodel.base_models import ConversionStatus

result = converter.convert("document.pdf")

if result.status == ConversionStatus.SUCCESS:
    print("Conversion successful")
elif result.status == ConversionStatus.PARTIAL_SUCCESS:
    print("Partial conversion:")
    for error in result.errors:
        print(f"  {error.error_message}")
else:  # FAILURE
    print("Conversion failed:")
    for error in result.errors:
        print(f"  {error.error_message}")

For batch processing with error handling:

# Continue processing on errors
results = converter.convert_all(
    ["doc1.pdf", "doc2.pdf", "doc3.pdf"],
    raises_on_error=False
)

for result in results:
    if result.status == ConversionStatus.SUCCESS:
        result.document.save_as_markdown(f"{result.input.file.stem}.md")
    else:
        print(f"Failed: {result.input.file}")

CLI Usage

# Basic conversion
docling document.pdf

# Convert to specific output
docling --to markdown document.pdf

# With custom model path
docling --artifacts-path /path/to/models document.pdf

# Using VLM pipeline
docling --pipeline vlm --vlm-model granite_docling document.pdf

Reference Documentation

Parsing Options - DocumentConverter initialization, format-specific options, OCR configuration
Batch Processing - convert_all(), error handling, concurrency patterns
Chunking - HierarchicalChunker, HybridChunker, RAG integration
Output Formats - export_to_markdown(), export_to_html(), export_to_dict(), document structure

Key Types

DocumentConverter: Main conversion class
ConversionResult: Result of conversion with document and status
DoclingDocument: Unified document representation (Pydantic model)
InputFormat: Enum of supported input formats
ConversionStatus: SUCCESS, PARTIAL_SUCCESS, FAILURE
PdfPipelineOptions: Configuration for PDF pipeline
ImageRefMode: EMBEDDED, REFERENCED, PLACEHOLDER

Integration Examples

LangChain

from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownTextSplitter

converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()

splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)

LlamaIndex

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from llama_index.core import Document

converter = DocumentConverter()
result = converter.convert("document.pdf")

chunker = HybridChunker()
chunks = list(chunker.chunk(result.document))

documents = [
    Document(text=chunk.text, metadata=chunk.meta.export_json_dict())
    for chunk in chunks
]

Notes

Docling uses a synchronous API (no native async support)
Models are downloaded automatically on first use (can be prefetched)
Supports local execution for air-gapped environments
Supports GPU acceleration for OCR and table detection
Default models run on CPU; GPU requires configuration

Source

git clone https://github.com/existential-birds/beagle/blob/main/plugins/beagle-core/skills/docling/SKILL.mdView on GitHub

Overview

Docling parses PDFs, DOCX, PPTX, HTML, images, and 15+ formats into structured data with advanced layout understanding. It supports extracting text, converting to Markdown, HTML, or JSON, and chunking content for RAG pipelines or batch processing. The library is driven by a DocumentConverter and exposes standardized ConversionResult data.

How This Skill Works

Docling provides a DocumentConverter as the main entry point to process diverse inputs with configurable pipelines. It outputs a ConversionResult containing the parsed document, status, and errors, along with export methods like export_to_markdown, export_to_html, and export_to_dict for downstream use. You can tailor behavior with format_options and allowed formats to fit your workflow.

When to Use It

Ingest PDFs, DOCX, PPTX, or image-based documents to extract text and structure.
Convert documents into Markdown, HTML, or JSON for CMS, analytics, or data pipelines.
Chunk long documents for RAG pipelines using HierarchicalChunker or HybridChunker.
Batch-process large collections of documents with convert_all for automation.
Handle multiple formats through a single API and produce consistent outputs for downstream apps.

Quick Start

Step 1: Install and import DocumentConverter from docling and prepare your source (URL, path, or bytes).
Step 2: Create a DocumentConverter (optionally set allowed_formats) and call converter.convert(source).
Step 3: Access outputs via result.document.export_to_markdown(), export_to_html(), or export_to_dict(), or save to files.

Best Practices

Enable OCR when processing scanned or image-based documents to improve text extraction.
Use format_options to tune OCR, table structure, or other pipeline options per input type.
Choose an appropriate chunking strategy (HierarchicalChunker vs HybridChunker) based on document length and search needs.
Validate outputs with export_to_markdown, export_to_html, or export_to_dict and check ConversionResult.errors.
For large collections, leverage convert_all for batch processing and monitor per-document status.

Example Use Cases

Convert a research paper PDF to Markdown and JSON for ingestion into a knowledge base.
Extract content from a PPTX deck and export it as HTML for a web presentation.
Batch-process a folder of scanned invoices, using OCR to produce DocLing JSON for accounting systems.
Transform HTML pages into Markdown with structured headings for a static site generator.
Use HierarchicalChunker to prepare a long technical report for a RAG-based search application.

Frequently Asked Questions

Add this skill to your agents