Document Processing

43 projects

Showing 36 of 43 projects

A collection of example skills for Claude that demonstrate how to create reusable instruction sets for specialized AI tasks.

#instruction-sets#agent-skills#workflow-automation

An opinionated RAG framework for integrating generative AI into applications, supporting any LLM, vector store, and file type.

#ai#database#rag-framework

Stars39.3k

Forks3.7k

Last commit1 year ago

OCRmyPDFPython

A command-line tool that adds an OCR text layer to scanned PDF files, making them searchable and copy-pasteable.

#text-extraction#pdf-ocr#pdf-a

A pure-Python PDF library for splitting, merging, cropping, transforming, and extracting data from PDF files.

#text-extraction#pdf-merging#pdf-documents

Stars10.1k

Forks1.6k

Last commit20 hours ago

gumbo-parserHTML

A pure-C HTML5 parsing library implementing the HTML5 parsing algorithm.

#c-library#html5#portable

Stars5.2k

Forks665

Last commit6 months ago

myGPTReaderPython

A Slack bot that reads and summarizes webpages, documents, and videos using ChatGPT, with voice chat capabilities.

#ai#content-summarization#embedding

Stars4.4k

Forks441

Last commit5 months ago

sumyPython

A Python library and CLI tool for automatic text summarization using extractive methods like LexRank, LSA, Luhn, and Edmundson.

#text-extraction#extractive-summarization#summarizer

Stars3.7k

Forks542

Last commit4 days ago

OP Vault ChatGPTJavaScript

Give ChatGPT long-term memory by uploading custom knowledge base files (PDF, txt, epub) and asking questions via a React frontend.

#openai#question-answering#vector-database

Stars3.4k

Forks297

Last commit1 year ago

gosseractGo

A Go package for Optical Character Recognition (OCR) using the Tesseract C++ library.

#text-extraction#tesseract-ocr#go-library

Stars3.1k

Forks307

Last commit6 months ago

Awesome OCR

A curated list of awesome open-source OCR software, libraries, datasets, and literature.

#text-extraction#open-source#historical-documents

Stars3.1k

Forks376

Last commit2 years ago

pikepdfPython

A Python library for reading, writing, repairing, and transforming PDFs, powered by the qpdf C++ library.

#pikepdf#linearization#pdf-a

Stars2.8k

Forks225

Last commit8 days ago

lopdfRust

A Rust library for creating, merging, modifying, and decrypting PDF documents with support for modern object streams.

#object-streams#pdf-manipulation#pdf-library

Stars2.2k

Forks276

Last commit19 hours ago

ITextC#

A high-performance .NET library for creating, manipulating, inspecting, and maintaining PDF documents.

#digital-signature#pdf-a#library

A Java JNA wrapper for Tesseract OCR API, enabling OCR functionality in Java applications.

#text-extraction#pdf-ocr#java

Stars1.8k

Forks381

Last commit1 month ago

HexaPDFRuby

A pure Ruby library for creating, manipulating, merging, securing, and optimizing PDF files with a Ruby-esque API.

#open-source#pdf-manipulation#command-line-tool

Stars1.4k

Forks81

Last commit1 month ago

PHPPowerPointPHP

A pure PHP library for reading and writing presentation files in PowerPoint (PPTX) and OpenDocument (ODP) formats.

#hacktoberfest#office#server-side

Stars1.4k

Forks543

Last commit21 days ago

printpdfRust

A Rust library for creating, reading, writing, and rendering PDF documents with support for graphics, fonts, and experimental HTML layout.

#pdf-reader#html-to-pdf#graphics

Stars1.1k

Forks135

Last commit17 hours ago

PDF-WriterC

A high-performance C++ library for creating, parsing, and manipulating PDF files and streams.

#open-source#pdf-manipulation#cmake

Stars1.0k

Forks230

Last commit1 month ago

pdf_oxideRust

A high-performance PDF toolkit for text/image extraction, markdown conversion, and PDF editing, built in Rust with Python, WASM, CLI, and MCP server bindings.

#text-extraction#open-source#pdf-parser

Stars901

Forks110

Last commit8 hours ago

CombinePDFRuby

A pure Ruby library for merging PDF files, adding page numbers, watermarks, and stamps.

#pdf-merging#page-numbering#pdf-merge

Stars784

Forks179

Last commit1 year ago

tesseract-ocrRuby

A Ruby wrapper library that provides Ruby bindings and a Ruby-esque interface to the Tesseract OCR API.

#ruby-wrapper#ffi#tesseract-ocr

Stars636

Forks71

Last commit9 years ago

sejdaJava

A task-oriented Java SDK for PDF manipulation with ready-to-use operations and extensible architecture.

#pdf-editing#java-library#open-source

Stars545

Forks69

Last commit3 days ago

ShapeCrawlerC#

A .NET library for reading, modifying, and generating PowerPoint (PPTX) presentations without requiring Microsoft Office.

#office-open-xml#presentation-automation#csharp

Stars434

Forks88

Last commit5 days ago

Text AnalysisJulia

A Julia package providing standard tools and models for text analysis and natural language processing.

#nlp-library#julia#text-classification

Stars384

Forks92

Last commit3 months ago

knowledge-gptPython

Extract and index knowledge from websites, PDFs, docs, and YouTube to power Q&A sessions using GPT and other language models.

#youtube-transcription#semantic-search#knowledge-extraction

A generic EPUB parser and generator library for Ruby that supports EPUB 2 and EPUB 3 specifications.

#ebook-parsing#epub3#digital-publishing

Stars255

Forks44

Last commit9 days ago

PSWritePDFC#

A PowerShell module for creating, editing, splitting, merging, and converting PDF files across Windows, Linux, and macOS.

#hacktoberfest#itext7#create

Stars232

Forks23

Last commit1 month ago

RGhostRuby

A Ruby API for document creation and conversion using Ghostscript, supporting PDF, PS, GIF, TIF, PNG, JPG formats.

#postscript#api#ghostscript

Stars186

Forks48

Last commit2 years ago

NPOIC#

A .NET library for reading and writing Office formats (Excel, Word) without requiring Microsoft Office installation.

#apache-poi#spreadsheet#file-format

Stars160

Forks19

Last commit2 months ago

docker-texliveDockerfile

A Docker image providing a full TeX Live distribution with additional tools like Pandoc, Inkscape, and GraphViz for LaTeX workflows.

#academic-writing#inkscape#texlive

Stars116

Forks30

Last commit25 days ago

bindPDFSwift

A friendly macOS desktop app to combine multiple PDF files into a single PDF with a simple drag-and-drop interface.

#pdf-utilities#desktop-application#combine-pdf

Stars112

Forks9

Last commit6 years ago

pdf2htmlexElixir

Elixir library that converts PDF documents to HTML while preserving text and formatting.

#pdf-conversion#elixir#format-retention

Stars92

Forks18

Last commit10 years ago

opcGo

Go implementation of the Open Packaging Conventions (OPC) for reading and writing formats like .docx and .xlsx.

#files#go-library#file-format

Stars80

Forks7

Last commit2 years ago

IntellyWeavePython

An AI-powered OSINT platform that extracts entities, visualizes relationships, and uses multi-agent reasoning to analyze documents for intelligence.

#fastapi#intelligence-analysis#osint

Stars71

Forks9

Last commit6 months ago

gxpdfGo

A high-performance Go library for PDF creation, reading, table extraction, digital signatures, and encryption.

#text-extraction#go-modules#open-source

A C# wrapper for the QPDF library enabling PDF manipulation, optimization, and transformation with cross-platform support.

#pdf-lib#nuget-package#pdf-manipulation

Stars25

Forks5

Last commit2 years ago

Page 1 of 2

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub