A curated list of awesome open-source OCR software, libraries, datasets, and literature.
Awesome OCR is a curated GitHub repository listing open-source software, libraries, datasets, and literature related to Optical Character Recognition. It provides a centralized resource for developers and researchers working on text extraction from images and scanned documents, covering engines, tools, training data, and academic references.
Developers, data scientists, researchers, and digital humanists who need to implement, evaluate, or improve OCR systems using open-source tools and datasets.
It saves significant time by aggregating and categorizing a vast array of OCR resources—from production-ready engines like Tesseract to niche datasets for historical scripts—in a single, community-vetted list, eliminating the need to scour the internet for reliable tools and research materials.
Links to awesome OCR projects
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Aggregates hundreds of OCR engines, libraries, datasets, and academic papers in one place, as evidenced by categories spanning from Tesseract to niche tools for historical scripts.
Includes resources for programming languages like Python, JavaScript, and Rust, plus datasets for non-Latin scripts such as Arabic and Fraktur, lowering barriers for diverse projects.
Prioritizes open-source tools and reproducible research, with clear licensing notes (e.g., Apache 2.0, GPL) to encourage community-driven development and transparency.
Lists not just engines but also preprocessing utilities, format converters (e.g., hOCR to ALTO), and evaluation tools, addressing real-world OCR pipeline needs.
The list merely aggregates links without rating, comparing, or guiding users on which tools perform best for specific use cases, leaving trial and error.
Explicitly marks sections like 'Older and possibly abandoned OCR engines,' requiring users to manually verify currency and maintenance status of resources.
Sheer volume and technical depth—from academic articles to CLI tools—can intimidate newcomers without prior OCR knowledge, despite comprehensive coverage.