A Python library for parsing diverse document formats into structured data, optimized for integration with generative AI applications.
Docling is a Python library that converts documents from various formats—such as PDFs, Office files, images, and audio—into structured, machine-readable data. It solves the problem of preparing unstructured documents for use with generative AI models by extracting text, layout, tables, and other elements into a unified format.
Developers and data scientists building AI applications that need to process and understand documents, especially those working with RAG pipelines, agentic AI, or knowledge management systems.
Developers choose Docling for its extensive format support, advanced PDF understanding, and seamless integrations with popular AI frameworks, all while offering local execution for data privacy and security.
Get your documents ready for gen AI
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Parses PDFs, DOCX, images, audio, and over a dozen other formats into a single DoclingDocument, eliminating the need for multiple specialized libraries.
Extracts page layout, reading order, table structures, and formulas using models like Heron, providing semantic understanding beyond basic text extraction.
Offers native connectors for LangChain, LlamaIndex, and Crew AI, allowing parsed documents to feed directly into RAG pipelines and agentic workflows.
Runs entirely on-device without cloud calls, making it suitable for sensitive data or air-gapped environments, as emphasized in the philosophy.
Structured information extraction is labeled as beta, meaning it may be prone to breaking changes or incomplete coverage for production use.
Relies on external models for OCR, VLM, and ASR tasks, which adds installation overhead and potential licensing issues, as seen with GraniteDocling integration.
Dropped support for Python 3.9 in version 2.70.0, forcing teams to upgrade their environments, which can disrupt legacy projects.