docling
Scannednpx machina-cli add skill existential-birds/beagle/docling --openclawDocling Document Parser
Docling is a document parsing library that converts PDFs, Word documents, PowerPoint, images, and other formats into structured data with advanced layout understanding.
Quick Start
Basic document conversion:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # URL, Path, or BytesIO
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Core Concepts
DocumentConverter
The main entry point for document conversion. Supports various input formats and conversion options.
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
# Basic converter (all formats enabled)
converter = DocumentConverter()
# Restricted formats
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX]
)
# Custom pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
ConversionResult
All conversion operations return a ConversionResult containing:
document: The parsedDoclingDocumentstatus:ConversionStatus.SUCCESS,PARTIAL_SUCCESS, orFAILUREerrors: List of errors encountered during conversioninput: Information about the source document
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
data = result.document.export_to_dict()
Supported Formats
Input Formats
- Documents: PDF, DOCX, PPTX, XLSX
- Markup: HTML, Markdown, AsciiDoc
- Data: CSV, JSON (Docling format)
- Images: PNG, JPEG, TIFF, BMP, WEBP
- Audio: WAV, MP3
- Video Text: WebVTT
- Schema-specific: USPTO XML, JATS XML, METS-GBS
Output Formats
- Markdown:
export_to_markdown()orsave_as_markdown() - HTML:
export_to_html()orsave_as_html() - JSON:
export_to_dict()orsave_as_json()(note: noexport_to_json()method) - Text:
export_to_text()orexport_to_markdown(strict_text=True)orsave_as_markdown(strict_text=True) - DocTags:
export_to_doctags()orsave_as_doctags()
Common Patterns
Single File Conversion
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
# Export to different formats
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
json_data = result.document.export_to_dict()
# Or save directly to file
result.document.save_as_markdown("output.md")
result.document.save_as_html("output.html")
result.document.save_as_json("output.json")
Batch Processing
See references/batch.md for details on convert_all().
URL Conversion
converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")
Binary Stream Conversion
from io import BytesIO
from docling.datamodel.base_models import DocumentStream
with open("document.pdf", "rb") as f:
buf = BytesIO(f.read())
source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)
Format-Specific Options
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
# Configure PDF-specific options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "es"]
pipeline_options.do_table_structure = True
pipeline_options.generate_page_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
Resource Limits
converter = DocumentConverter()
# Limit file size (bytes) and page count
result = converter.convert(
"large_document.pdf",
max_file_size=20_971_520, # 20 MB
max_num_pages=100
)
Document Chunking
See references/chunking.md for RAG integration.
DoclingDocument Structure
The DoclingDocument is a Pydantic model representing parsed content:
# Access document structure
doc = result.document
# Content items (lists)
doc.texts # TextItem instances (paragraphs, headings, etc.)
doc.tables # TableItem instances
doc.pictures # PictureItem instances
doc.key_value_items # Key-value pairs
# Structure (tree nodes)
doc.body # Main content hierarchy
doc.furniture # Headers, footers, page numbers
doc.groups # Lists, chapters, sections
# Iterate all elements in reading order
for item, level in doc.iterate_items():
print(f"{' ' * level}{item.label}: {item.text[:50]}")
Advanced Features
OCR Configuration
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
EasyOcrOptions,
TesseractOcrOptions,
TesseractCliOcrOptions,
OcrMacOptions,
RapidOcrOptions
)
# EasyOCR (default)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])
# Tesseract
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])
# RapidOCR
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = RapidOcrOptions()
Table Extraction Options
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TableFormerMode
)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = True
# Use cell matching (map to PDF cells)
pipeline_options.table_structure_options.do_cell_matching = True
# Or use predicted cells
pipeline_options.table_structure_options.do_cell_matching = False
# Choose accuracy mode
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
Page Images
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True # Needed for HTML export with images
# Export with embedded images
result.document.save_as_html(
"output.html",
image_mode=ImageRefMode.EMBEDDED
)
Error Handling
from docling.datamodel.base_models import ConversionStatus
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
print("Conversion successful")
elif result.status == ConversionStatus.PARTIAL_SUCCESS:
print("Partial conversion:")
for error in result.errors:
print(f" {error.error_message}")
else: # FAILURE
print("Conversion failed:")
for error in result.errors:
print(f" {error.error_message}")
For batch processing with error handling:
# Continue processing on errors
results = converter.convert_all(
["doc1.pdf", "doc2.pdf", "doc3.pdf"],
raises_on_error=False
)
for result in results:
if result.status == ConversionStatus.SUCCESS:
result.document.save_as_markdown(f"{result.input.file.stem}.md")
else:
print(f"Failed: {result.input.file}")
CLI Usage
# Basic conversion
docling document.pdf
# Convert to specific output
docling --to markdown document.pdf
# With custom model path
docling --artifacts-path /path/to/models document.pdf
# Using VLM pipeline
docling --pipeline vlm --vlm-model granite_docling document.pdf
Reference Documentation
- Parsing Options - DocumentConverter initialization, format-specific options, OCR configuration
- Batch Processing - convert_all(), error handling, concurrency patterns
- Chunking - HierarchicalChunker, HybridChunker, RAG integration
- Output Formats - export_to_markdown(), export_to_html(), export_to_dict(), document structure
Key Types
DocumentConverter: Main conversion classConversionResult: Result of conversion with document and statusDoclingDocument: Unified document representation (Pydantic model)InputFormat: Enum of supported input formatsConversionStatus: SUCCESS, PARTIAL_SUCCESS, FAILUREPdfPipelineOptions: Configuration for PDF pipelineImageRefMode: EMBEDDED, REFERENCED, PLACEHOLDER
Integration Examples
LangChain
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownTextSplitter
converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)
LlamaIndex
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from llama_index.core import Document
converter = DocumentConverter()
result = converter.convert("document.pdf")
chunker = HybridChunker()
chunks = list(chunker.chunk(result.document))
documents = [
Document(text=chunk.text, metadata=chunk.meta.export_json_dict())
for chunk in chunks
]
Notes
- Docling uses a synchronous API (no native async support)
- Models are downloaded automatically on first use (can be prefetched)
- Supports local execution for air-gapped environments
- Supports GPU acceleration for OCR and table detection
- Default models run on CPU; GPU requires configuration
Source
git clone https://github.com/existential-birds/beagle/blob/main/plugins/beagle-core/skills/docling/SKILL.mdView on GitHub Overview
Docling parses PDFs, DOCX, PPTX, HTML, images, and 15+ formats into structured data with advanced layout understanding. It supports extracting text, converting to Markdown, HTML, or JSON, and chunking content for RAG pipelines or batch processing. The library is driven by a DocumentConverter and exposes standardized ConversionResult data.
How This Skill Works
Docling provides a DocumentConverter as the main entry point to process diverse inputs with configurable pipelines. It outputs a ConversionResult containing the parsed document, status, and errors, along with export methods like export_to_markdown, export_to_html, and export_to_dict for downstream use. You can tailor behavior with format_options and allowed formats to fit your workflow.
When to Use It
- Ingest PDFs, DOCX, PPTX, or image-based documents to extract text and structure.
- Convert documents into Markdown, HTML, or JSON for CMS, analytics, or data pipelines.
- Chunk long documents for RAG pipelines using HierarchicalChunker or HybridChunker.
- Batch-process large collections of documents with convert_all for automation.
- Handle multiple formats through a single API and produce consistent outputs for downstream apps.
Quick Start
- Step 1: Install and import DocumentConverter from docling and prepare your source (URL, path, or bytes).
- Step 2: Create a DocumentConverter (optionally set allowed_formats) and call converter.convert(source).
- Step 3: Access outputs via result.document.export_to_markdown(), export_to_html(), or export_to_dict(), or save to files.
Best Practices
- Enable OCR when processing scanned or image-based documents to improve text extraction.
- Use format_options to tune OCR, table structure, or other pipeline options per input type.
- Choose an appropriate chunking strategy (HierarchicalChunker vs HybridChunker) based on document length and search needs.
- Validate outputs with export_to_markdown, export_to_html, or export_to_dict and check ConversionResult.errors.
- For large collections, leverage convert_all for batch processing and monitor per-document status.
Example Use Cases
- Convert a research paper PDF to Markdown and JSON for ingestion into a knowledge base.
- Extract content from a PPTX deck and export it as HTML for a web presentation.
- Batch-process a folder of scanned invoices, using OCR to produce DocLing JSON for accounting systems.
- Transform HTML pages into Markdown with structured headings for a static site generator.
- Use HierarchicalChunker to prepare a long technical report for a RAG-based search application.