A high-performance PDF toolkit for text/image extraction, markdown conversion, and PDF editing, built in Rust with Python, WASM, CLI, and MCP server bindings.
PDF Oxide is a high-performance, open-source PDF processing library built with a Rust core. It provides fast and reliable tools for extracting text and images from PDFs, converting them to markdown or HTML, and creating or editing PDF documents. It solves the problem of slow and restrictive PDF processing in data pipelines, AI applications, and document automation.
Developers and data engineers working with document processing pipelines, AI/ML practitioners building RAG systems, and anyone needing fast, reliable PDF text/image extraction or conversion in Python, Rust, or JavaScript environments.
Developers choose PDF Oxide for its exceptional speed (5-15x faster than popular alternatives), 100% reliability on a large test suite, permissive MIT/Apache-2.0 license (no AGPL restrictions), and its unified multi-platform API covering Rust, Python, WASM, CLI, and AI assistants via MCP.
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
Benchmarked at 0.8ms mean per document on 3,830 PDFs, making it 5x faster than PyMuPDF and 15x faster than pypdf for text extraction.
Achieves a 100% pass rate on the test corpus with zero panics or timeouts, ensuring consistent extraction across diverse real-world PDFs.
Provides native APIs for Rust and Python, WASM for JavaScript, a CLI tool, and an MCP server for AI assistants, covering a wide range of development environments.
Dual-licensed under MIT/Apache-2.0, allowing unrestricted commercial use without the copyleft restrictions of AGPL alternatives like PyMuPDF.
Lacks optical character recognition support, so it cannot extract text from scanned or image-based PDFs, limiting its applicability for legacy documents.
Requires a Rust toolchain for building from source, which can complicate deployment in environments where lightweight, interpreted language solutions are preferred.
As a version 0.3 project, it has a smaller ecosystem and may introduce breaking changes, with fewer third-party integrations compared to established libraries.
pdf_oxide is an open-source alternative to the following products:
PDFMiner is a Python library for extracting text and metadata from PDF documents, focusing on detailed layout analysis.
pdfplumber is a Python library for extracting and analyzing text, tables, and visual data from PDF files with high precision.
PyMuPDF is a Python binding for the MuPDF library, providing tools to manipulate PDF files including rendering, text extraction, and annotation.
pypdf is a pure-Python library for reading, splitting, merging, cropping, and transforming PDF files without external dependencies.
Python tool for converting files and office documents to Markdown.
Get your documents ready for gen AI
A cross-platform, safe, pure-Rust graphics API.
Sketch and take handwritten notes.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.