Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Python
  3. pdf_oxide

pdf_oxide

Apache-2.0Rustv0.3.41

A high-performance PDF toolkit for text/image extraction, markdown conversion, and PDF editing, built in Rust with Python, WASM, CLI, and MCP server bindings.

Visit WebsiteGitHubGitHub
717 stars78 forks0 contributors

What is pdf_oxide?

PDF Oxide is a high-performance, open-source PDF processing library built with a Rust core. It provides fast and reliable tools for extracting text and images from PDFs, converting them to markdown or HTML, and creating or editing PDF documents. It solves the problem of slow and restrictive PDF processing in data pipelines, AI applications, and document automation.

Target Audience

Developers and data engineers working with document processing pipelines, AI/ML practitioners building RAG systems, and anyone needing fast, reliable PDF text/image extraction or conversion in Python, Rust, or JavaScript environments.

Value Proposition

Developers choose PDF Oxide for its exceptional speed (5-15x faster than popular alternatives), 100% reliability on a large test suite, permissive MIT/Apache-2.0 license (no AGPL restrictions), and its unified multi-platform API covering Rust, Python, WASM, CLI, and AI assistants via MCP.

Overview

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

Use Cases

Best For

  • Building high-throughput document processing pipelines for RAG and LLM applications
  • Extracting structured text and table data from thousands of PDFs at scale
  • Converting academic papers or reports into clean markdown for analysis
  • Creating a local, privacy-focused PDF tool for AI assistants via the MCP server
  • Replacing AGPL-licensed PDF libraries like PyMuPDF in commercial projects
  • Programmatically generating, merging, or filling PDF forms and documents

Not Ideal For

  • Projects requiring optical character recognition (OCR) for scanned PDFs, as it only extracts text from native layers.
  • Environments with strict no-native-extension policies that mandate pure Python or JavaScript solutions.
  • Applications needing advanced PDF features like 3D content, multimedia annotations, or complex digital signature validation.
  • Legacy systems deeply integrated with older PDF libraries where migration effort outweighs performance gains.

Pros & Cons

Pros

Blazing Fast Performance

Benchmarked at 0.8ms mean per document on 3,830 PDFs, making it 5x faster than PyMuPDF and 15x faster than pypdf for text extraction.

Unmatched Reliability

Achieves a 100% pass rate on the test corpus with zero panics or timeouts, ensuring consistent extraction across diverse real-world PDFs.

Multi-Platform Versatility

Provides native APIs for Rust and Python, WASM for JavaScript, a CLI tool, and an MCP server for AI assistants, covering a wide range of development environments.

Permissive Licensing

Dual-licensed under MIT/Apache-2.0, allowing unrestricted commercial use without the copyleft restrictions of AGPL alternatives like PyMuPDF.

Cons

No OCR Capability

Lacks optical character recognition support, so it cannot extract text from scanned or image-based PDFs, limiting its applicability for legacy documents.

Rust Dependency Overhead

Requires a Rust toolchain for building from source, which can complicate deployment in environments where lightweight, interpreted language solutions are preferred.

Limited Maturity

As a version 0.3 project, it has a smaller ecosystem and may introduce breaking changes, with fewer third-party integrations compared to established libraries.

Open Source Alternative To

pdf_oxide is an open-source alternative to the following products:

p
pdfminer

PDFMiner is a Python library for extracting text and metadata from PDF documents, focusing on detailed layout analysis.

p
pdfplumber

pdfplumber is a Python library for extracting and analyzing text, tables, and visual data from PDF files with high precision.

P
PyMuPDF

PyMuPDF is a Python binding for the MuPDF library, providing tools to manipulate PDF files including rendering, text extraction, and annotation.

p
pypdf

pypdf is a pure-Python library for reading, splitting, merging, cropping, and transforming PDF files without external dependencies.

Frequently Asked Questions

Quick Stats

Stars717
Forks78
Contributors0
Open Issues41
Last commit1 day ago
CreatedSince 2025

Tags

#text-extraction#open-source#high-performance#command-line-tool#pdf-processing#python#document-conversion#python-bindings#wasm#pdf-library#document-processing#rust-library#pdf-generation#mcp-server#rust#data-extraction#pdf

Built With

W
WASM
R
Rust
P
Python

Links & Resources

Website

Included in

Python290.8kRust56.6k
Auto-fetched 1 day ago

Related Projects

markitdownmarkitdown

Python tool for converting files and office documents to Markdown.

Stars119,620
Forks7,941
Last commit14 days ago
doclingdocling

Get your documents ready for gen AI

Stars59,017
Forks4,058
Last commit4 days ago
wgpuwgpu

A cross-platform, safe, pure-Rust graphics API.

Stars17,033
Forks1,283
Last commit2 days ago
rnoternote

Sketch and take handwritten notes.

Stars11,181
Forks459
Last commit3 days ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub