Text Extraction

32 projects

Showing 32 of 32 projects

A Python utility for converting PDFs, Office documents, images, audio, and more into structured Markdown for LLM consumption.

#text-extraction#pdf-conversion#microsoft-office

A collection of utilities that help customize Windows and streamline everyday tasks.

#text-extraction#open-source-microsoft#productivity-tools

A command-line tool that adds an OCR text layer to scanned PDF files, making them searchable and copy-pasteable.

#text-extraction#pdf-ocr#pdf-a

Stars34.2k

Forks2.4k

Last commit2 days ago

pypdfPython

A pure-Python PDF library for splitting, merging, cropping, transforming, and extracting data from PDF files.

#text-extraction#pdf-merging#pdf-documents

Stars10.1k

Forks1.6k

Last commit23 days ago

ripgrep-allRust

A line-oriented search tool that extends ripgrep to search inside PDFs, Office documents, archives, and many other file types.

#search#text-extraction#ripgrep

Stars9.8k

Forks216

Last commit4 months ago

KreuzbergRust

A polyglot document intelligence framework with a Rust core for extracting text, metadata, and structured data from 91+ file formats.

#text-extraction#document-intelligence#batch-processing

A Python library for extracting and analyzing text, images, and metadata from PDF documents.

#text-extraction#pdf-tools#open-source

Stars7.0k

Forks1.0k

Last commit4 months ago

trafilaturaPython

A Python library and CLI tool for web crawling, scraping, and extracting main text, metadata, and comments from web pages.

#text-extraction#readability#article-extractor

Stars6.3k

Forks397

Last commit5 days ago

sumyPython

A Python library and CLI tool for automatic text summarization using extractive methods like LexRank, LSA, Luhn, and Edmundson.

#text-extraction#extractive-summarization#summarizer

Stars3.7k

Forks542

Last commit3 days ago

ocrad.jsJavaScript

A pure JavaScript OCR engine compiled from Ocrad via Emscripten for client-side text recognition in the browser.

#text-extraction#browser-ocr#webassembly

Stars3.5k

Forks380

Last commit5 years ago

gosseractGo

A Go package for Optical Character Recognition (OCR) using the Tesseract C++ library.

#text-extraction#tesseract-ocr#go-library

Stars3.1k

Forks307

Last commit6 months ago

Awesome OCR

A curated list of awesome open-source OCR software, libraries, datasets, and literature.

#text-extraction#open-source#historical-documents

Stars3.1k

Forks375

Last commit2 years ago

TRexSwift

A macOS menu bar app that uses OCR to copy any text visible on your screen directly to your clipboard.

#text-extraction#textrecognition#qr-code-reader

Stars1.9k

Forks62

Last commit11 days ago

Tess4JJava

A Java JNA wrapper for Tesseract OCR API, enabling OCR functionality in Java applications.

#text-extraction#pdf-ocr#java

Stars1.8k

Forks381

Last commit1 month ago

TextSnatcherVala

A lightweight Linux desktop application that extracts text from images using OCR with drag-and-drop simplicity.

#text-extraction#libhandy#tesseract-ocr

Stars1.4k

Forks54

Last commit2 years ago

treatRuby

A comprehensive natural language processing framework for Ruby with support for text extraction, parsing, and machine learning.

#text-extraction#computational-linguistics#text-analysis

Stars1.4k

Forks124

Last commit1 year ago

pdf_oxideRust

A high-performance PDF toolkit for text/image extraction, markdown conversion, and PDF editing, built in Rust with Python, WASM, CLI, and MCP server bindings.

#text-extraction#open-source#pdf-parser

A simple OCR API server that's easy to deploy with Docker or on Heroku.

#text-extraction#api#api-server

Stars767

Forks147

Last commit5 years ago

yomuRuby

A Ruby library for extracting text and metadata from various document formats using Apache Tika.

#text-extraction#content-analysis#apache-tika

Stars503

Forks125

Last commit3 years ago

ToxyC#

A .NET framework for extracting and exporting text and data from a wide variety of document formats.

#text-extraction#fileformats#office-documents

Stars455

Forks114

Last commit1 month ago

GrimRuby

A Ruby gem for extracting pages from PDFs as images and text strings using Ghostscript, ImageMagick, and pdftotext.

#text-extraction#ghostscript#image-extraction

Stars231

Forks52

Last commit2 years ago

MonkeyLearnR

Archived R package for accessing the Monkeylearn API for text classification and extraction.

#text-extraction#peer reviewed#text-classification

Stars92

Forks16

Last commit4 years ago

gxpdfGo

A high-performance Go library for PDF creation, reading, table extraction, digital signatures, and encryption.

#text-extraction#go-modules#open-source

Stars46

Forks6

Last commit2 months ago

Wagtail-TextractPython

Enables full-text search within uploaded documents (PDF, Word, Excel) in Wagtail CMS.

#search#text-extraction#pdf-search

Stars34

Forks14

Last commit2 years ago

vesseractV

A V programming language wrapper for Tesseract-OCR, enabling text extraction and OCR operations from images.

#text-extraction#document-analysis#wrapper-library

Stars17

Forks3

Last commit4 years ago

tikalinkextractHTML

Extracts hyperlinks from files using Apache Tika for batch processing and web archiving workflows.

#text-extraction#digitalpreservation#code4lib

Stars11

Forks0

Last commit1 year ago

BlazorServerImageRecognitionAppC#

A Blazor Server app that uses Azure Computer Vision to extract printed text from uploaded images.

#text-extraction#net10#web-app

Stars11

Forks3

Last commit4 months ago

docx_cr_converterCrystal

A Crystal library that extracts text and formatting from .DOCX files and converts them to Markdown.

#text-extraction#library#docx

Stars11

Forks1

Last commit3 years ago

PyLngJava

A Python parser for HandyGames .lng files, converting them to JSON for translation.

#text-extraction#game-localization#deepl

Stars4

Forks1

Last commit2 years ago

i18n-scanner-toolkitTypeScript

A TypeScript-first, framework-agnostic toolkit for scanning, extracting, and managing i18n translations with CSV support.

#text-extraction#internationalization#csv-export

A WezTerm plugin that extracts text from command output and inserts it into the next prompt.

#text-extraction#lua-scripting#command-line-tools

Stars2

Forks0

Last commit10 months ago

abbi-ng-ai-image-descriptorTypeScript

Angular web app that generates alt texts and transcribes text in images using AI models from OpenAI and Google.

#text-extraction#web-app#ai-vision

Stars1

Forks0

Last commit3 days ago

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub