Showing 31 of 31 projects
An open-source OCR engine that converts images to text, supporting over 100 languages and multiple output formats.
An open-source web crawler and scraper that converts web content into clean, LLM-ready Markdown for RAG, agents, and data pipelines.
A Python library for parsing diverse document formats into structured data, optimized for integration with generative AI applications.
A JavaScript library for reading, writing, and processing spreadsheet data across Excel, CSV, and other formats.
A Go package for fast and simple retrieval of values from JSON documents using path syntax.
A Rust-based firmware analysis tool for identifying and extracting embedded files and data.
A Rust-based firmware analysis tool for identifying and extracting embedded files and data from binary files.
An incredibly fast web crawler designed for OSINT (Open Source Intelligence) data extraction.
A Python library that provides reliable, validated JSON outputs from any LLM using Pydantic models.
A scalable Java framework for building web crawlers, covering downloading, URL management, content extraction, and persistence.
A Java DSL for reading, querying, and manipulating JSON documents using XPath-like expressions.
A Python library for extracting and analyzing text, images, and metadata from PDF documents.
A Node.js web crawler with server-side jQuery, rate limiting, and proxy support for efficient scraping.
A JavaScript library for parsing text to extract dates, times, phone numbers, emails, places, and other structured information.
A pure Swift HTML parser with DOM, CSS, and jQuery-like methods for parsing, manipulating, and cleaning HTML across Apple platforms and Linux.
An open-source Java web crawler that provides a simple interface for multi-threaded web crawling.
A lightweight, efficient, and fast high-level web crawling and scraping framework for .NET.
A portable C library for reading and writing streaming archives in multiple formats, with command-line tools.
A Ruby library for reading and parsing spreadsheet files (Excel, OpenOffice, CSV) with a unified interface.
A robust Go library for parsing RSS, Atom, and JSON feeds with support for extensions and invalid feed handling.
Deep neural network to extract structured information from invoice documents with a customizable UI and training tools.
A pure C# compression library for .NET that reads and writes multiple archive formats with forward-only streaming support.
A Rust library for parsing HTML and querying elements using CSS selectors.
A simple and fast HTML and XML parser for PHP with CSS selector and XPath support.
A Java library and command-line tool for extracting tables from PDF files.
An open-source OSINT tool that automates Twitter intelligence analysis by extracting and structuring user data, activity, and geolocation information.
An async Python web scraping micro-framework built on asyncio and aiohttp for fast, extensible crawling.
A batteries-included Ruby framework for easy web-scraping with built-in debug mode and rate limiting.
A CSS-like selector language for querying and filtering JSON documents.
A command-line tool that detects steganographically hidden data in PNG and BMP image files.
A tidyverse package for web scraping in R, inspired by Beautiful Soup and designed for data extraction workflows.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.