A polyglot document intelligence framework with a Rust core for extracting text, metadata, and structured data from 91+ file formats.
Kreuzberg is a polyglot document intelligence framework with a high-performance Rust core. It extracts text, metadata, images, and structured information from over 91 file formats, including PDFs, Office documents, images, and code files, solving the problem of fragmented document processing tools.
Developers and engineers building data pipelines, search systems, RAG applications, or any system requiring robust extraction from diverse document and code formats across multiple programming environments.
Developers choose Kreuzberg for its exceptional performance due to the Rust core, extensive format and language support, and the flexibility of native bindings for many programming languages, all within a single, unified framework.
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Handles 91+ formats across PDFs, Office docs, images, code, and academic files, eliminating the need for multiple extraction tools. The README lists eight categories with specific capabilities like Korean HWP files and JATS journal articles.
Provides native libraries for 11+ languages (Rust, Python, JS, etc.) and deployment options (CLI, REST API, MCP). The badges and installation section show precompiled binaries for cross-platform use, ensuring broad accessibility.
Built on Rust with SIMD optimizations, native PDFium, and streaming parsers for multi-GB files. The README emphasizes this enables processing at 'native speeds' without a GPU.
Parses 248 programming languages via tree-sitter to extract functions, imports, and docstrings. This is specifically highlighted for semantic chunking in RAG applications and AI coding assistants.
Supports multiple backends (Tesseract, PaddleOCR, EasyOCR) and VLM OCR via 146 LLM providers. The README details this extensible plugin system for customizing OCR workflows.
Requires external systems like ONNX Runtime 1.24+ for embeddings and separate OCR engine installations (e.g., Tesseract). The README notes these as mandatory steps, adding setup complexity.
Docker images are ~1.0-1.3GB, and precompiled bindings increase deployment overhead. This is admitted in the Docker section, impacting resource-constrained environments.
Some bindings have gaps, like Ruby lacking Windows support per the platform table. This fragments the polyglot promise for specific OS-language combinations.
With multiple OCR backends, code intelligence grammars, and deployment modes, tuning for optimal performance requires deep expertise. The extensive documentation implies a steep learning curve beyond basic extraction.