Data Extraction

109 projects

Showing 36 of 109 projects

tesseractC++

An open-source OCR engine that converts images to text, supporting over 100 languages and multiple output formats.

#c-plus-plus-library#hacktoberfest#open-source

Stars75.4k

Forks10.7k

Last commit2 days ago

crawl4aiPython

An open-source web crawler and scraper that converts web content into clean, LLM-ready Markdown for RAG, agents, and data pipelines.

#playwright#ai-agents#markdown-generation

Stars73.3k

Forks7.5k

Last commit6 days ago

doclingPython

A Python library for parsing diverse document formats into structured data, optimized for integration with generative AI applications.

#ai#tables#documents

Stars63.5k

Forks4.5k

Last commit4 days ago

xlsx

A JavaScript library for reading, writing, and processing spreadsheet data across Excel, CSV, and other formats.

#database#open-source#spreadsheet

Stars36.3k

Forks7.9k

Last commit2 years ago

GJSONGo

A Go package for fast and simple retrieval of values from JSON documents using path syntax.

#library#json-query#dot-notation

Stars15.5k

Forks906

Last commit2 months ago

BinwalkRust

A Rust-based firmware analysis tool for identifying and extracting embedded files and data from binary files.

#entropy-analysis#embedded-systems#security-tools

Stars14.1k

Forks1.8k

Last commit1 month ago

BinwalkRust

A Rust-based firmware analysis tool for identifying and extracting embedded files and data.

#entropy-analysis#embedded-systems#security-tools

Stars14.1k

Forks1.8k

Last commit1 month ago

instructorPython

A Python library that provides reliable, validated JSON outputs from any LLM using Pydantic models.

#structured-output#pydantic#python-library

Stars13.6k

Forks1.2k

Last commit8 days ago

PhotonPython

An incredibly fast web crawler designed for OSINT (Open Source Intelligence) data extraction.

#information-gathering#spider#osint

Stars13.0k

Forks1.7k

Last commit5 months ago

webmagicJava

A scalable Java framework for building web crawlers, covering downloading, URL management, content extraction, and persistence.

#distributed-systems#crawler#html-parsing

Stars11.7k

Forks4.1k

Last commit7 months ago

JsonPathJava

A Java DSL for reading, querying, and manipulating JSON documents using XPath-like expressions.

#java-library#json-query#dsl

Stars9.4k

Forks1.7k

Last commit4 months ago

pdfminer.sixPython

A Python library for extracting and analyzing text, images, and metadata from PDF documents.

#text-extraction#pdf-tools#open-source

Stars7.0k

Forks1.0k

Last commit4 months ago

Node-CrawlerTypeScript

A Node.js web crawler with server-side jQuery, rate limiting, and proxy support for efficient scraping.

#proxy-support#jquery#spider

Stars6.8k

Forks866

Last commit1 month ago

Knwl.jsJavaScript

A JavaScript library for parsing text to extract dates, times, phone numbers, emails, places, and other structured information.

#plugin-system#information-retrieval#natural-language-processing

Stars5.3k

Forks211

Last commit2 years ago

SwiftSoupSwift

A pure Swift HTML parser with DOM, CSS, and jQuery-like methods for parsing, manipulating, and cleaning HTML across Apple platforms and Linux.

#dom-manipulation#parse#css-selectors

Stars5.1k

Forks397

Last commit14 days ago

Crawler4jJava

An open-source Java web crawler that provides a simple interface for multi-threaded web crawling.

#java-library#open-source#crawling-framework

Stars4.6k

Forks1.9k

Last commit4 years ago

DotnetSpiderC#

A lightweight, efficient, and fast high-level web crawling and scraping framework for .NET.

#web-crawling#distributed#redis

Stars4.1k

Forks1.1k

Last commit3 months ago

libarchiveC

A portable C library for reading and writing streaming archives in multiple formats, with command-line tools.

#stream-processing#c-library#gz

A Ruby library for reading and parsing spreadsheet files (Excel, OpenOffice, CSV) with a unified interface.

#libreoffice#file-processing#ruby-gem

Stars2.9k

Forks501

Last commit9 months ago

gofeedGo

A robust Go library for parsing RSS, Atom, and JSON feeds with support for extensions and invalid feed handling.

#rss-feed#atom-feed#rss

Stars2.9k

Forks219

Last commit8 days ago

InvoiceNetPython

Deep neural network to extract structured information from invoice documents with a customizable UI and training tools.

#document-intelligence#invoice-processing#deep-learning

Stars2.7k

Forks413

Last commit2 years ago

SharpCompressC#

A pure C# compression library for .NET that reads and writes multiple archive formats with forward-only streaming support.

#stream-processing#bzip2#tar

Stars2.6k

Forks516

Last commit5 days ago

scraperRust

A Rust library for parsing HTML and querying elements using CSS selectors.

#dom-manipulation#hacktoberfest#css-selectors

Stars2.4k

Forks126

Last commit15 days ago

DiDOMPHP

A simple and fast HTML and XML parser for PHP with CSS selector and XPath support.

#dom-manipulation#css-selectors#php-library

Stars2.2k

Forks200

Last commit5 months ago

TabulaJava

A Java library and command-line tool for extracting tables from PDF files.

#java-library#batch-processing#extraction-engine

Stars2.0k

Forks451

Last commit1 year ago

tinfoleakPython

An open-source OSINT tool that automates Twitter intelligence analysis by extracting and structuring user data, activity, and geolocation information.

#twitter-analysis#python-tool#socialmedia

Stars2.0k

Forks266

Last commit7 years ago

ruiaPython

An async Python web scraping micro-framework built on asyncio and aiohttp for fast, extensible crawling.

#python-3#asyncio#aiohttp

Stars1.7k

Forks186

Last commit3 years ago

UptonHTML

A batteries-included Ruby framework for easy web-scraping with built-in debug mode and rate limiting.

#debugging-tools#nokogiri#crawler

Stars1.6k

Forks108

Last commit7 years ago

ZstegRuby

A command-line tool that detects steganographically hidden data in PNG and BMP image files.

#png-analysis#image-analysis#ctf-tools

Stars1.6k

Forks163

Last commit5 months ago

JSONSelectJavaScript

A CSS-like selector language for querying and filtering JSON documents.

#stream-filtering#css-selectors#language-agnostic

Stars1.6k

Forks112

Last commit4 years ago

rvest <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

A tidyverse package for web scraping in R, inspired by Beautiful Soup and designed for data extraction workflows.

#r-package#r-language#html-parsing

A browser forensics tool for analyzing web artifacts from Google Chrome and other Chromium-based browsers.

#digital-forensics#chrome#browser-forensics

Stars1.5k

Forks182

Last commit4 days ago

WombatRuby

A lightweight Ruby web crawler and scraper with an elegant DSL for extracting structured data from web pages.

#dsl#crawler#ruby-gem

Stars1.4k

Forks128

Last commit3 months ago

PHP SpiderPHP

A configurable and extensible PHP web spider for crawling and scraping websites with support for breadth-first/depth-first traversal, caching, and custom filters.

#event-driven#caching#css-selectors

Stars1.3k

Forks231

Last commit22 days ago