Data Extraction

#distributed#real-time-processing#distributed-systems

storm-crawlerJava

A scalable, mature, and versatile web crawler built on Apache Storm for building low-latency, distributed crawling systems.

Stars986

Forks285

Last commit19 hours ago

CrawlerElixir

A high-performance web crawler and scraper built in Elixir with worker pooling and rate limiting.

#elixir#spider#offline

Stars957

Forks89

Last commit1 month ago

CoreXLSXSwift

A pure Swift library for parsing and reading Excel XLSX spreadsheet files.

#excelreader#ios#codable

Stars906

Forks111

CoreXLSXSwift

A pure Swift library for parsing and reading Excel XLSX spreadsheet files.

#excelreader#ios#codable

Stars906

Forks111

#text-extraction#open-source#pdf-parser

pdf_oxideRust

A high-performance PDF toolkit for text/image extraction, markdown conversion, and PDF editing, built in Rust with Python, WASM, CLI, and MCP server bindings.

Stars901

Forks110

Last commit11 hours ago

commonregexGo

A Go library providing pre-built regular expressions for common patterns like dates, emails, and phone numbers.

#developer-tools#regex#validation

Stars898

Forks72

#bzip2#tar#static-library

bit7zC++

A C++ static library providing a clean, cross-platform interface to 7-Zip for archive compression and extraction.

Stars847

Forks146

Last commit4 days ago

xidelPascal

A command-line tool to extract data from HTML/XML pages and JSON APIs using CSS, XPath, XQuery, JSONiq, and pattern matching.

#rest#css-selectors#http

Stars840

Forks46

#web-crawling#spider#crawler

SpidrRuby

A versatile Ruby web spidering library for crawling sites, domains, or specific links with extensive filtering and callback support.

Stars835

Forks107

Last commit6 months ago

htmlqueryGo

A Go package for querying HTML documents using XPath expressions with built-in caching for performance.

#caching#xpath-selector#html-parsing

Stars784

Forks80

Last commit21 days ago

xpathGo

A Go package for querying XML, HTML, and JSON documents using XPath expressions.

#xpath-query#selects-descendants#document-query

A Go web scraping framework that extracts structured data from websites using CSS selectors, including JavaScript-rendered pages.

#chrome-fetcher#scraping-websites#javascript-rendering

Stars715

Forks83

Last commit3 years ago

jsongrepRust

A command-line tool and Rust library for fast querying of JSON, YAML, TOML, and other documents using regular path expressions.

#search#developer-tools#yaml

Stars656

Forks12

Last commit4 days ago

url-patternCoffeeScript

A JavaScript library for matching and generating strings using patterns easier than regex, ideal for URL routing and data extraction.

#pattern-parsing#regex-alternative#string-matching

Stars587

Forks43

#c-library#binary-file#xls

libxlsC

A C library for reading binary Excel (XLS) files with a command-line tool for converting XLS to CSV.

Stars532

Forks146

Last commit11 months ago

Awesome Website Change Monitoring

A comprehensive curated list of open-source and hosted tools for monitoring and detecting changes on websites.

#change-detection#diffing#awesome-list

Stars514

Forks41

Last commit9 months ago

xmlqueryGo

A Go package for querying XML documents using XPath expressions with built-in caching for performance.

#go-library#streaming-parser#utf-16-support

Stars490

Forks94

Last commit21 days ago

DeepPluckRuby

A Ruby gem that efficiently plucks attributes from nested ActiveRecord associations without loading full records.

#rubygems#rails-optimization#rails

Stars459

Forks14

#text-extraction#fileformats#office-documents

ToxyC#

A .NET framework for extracting and exporting text and data from a wide variety of document formats.

Stars456

Forks114

Last commit1 month ago

NokolexborC

A high-performance, Nokogiri-compatible HTML5 parser for Ruby with CSS selector and XPath support.

#dom-manipulation#css-selectors#html5

Stars414

Forks8

Last commit22 days ago

Lambda SoupOCaml

A functional HTML scraping and manipulation library for OCaml with CSS selector support.

#ocaml-library#functional-programming#css-selectors

Stars409

Forks35

Web Scraping Reference: Cheat Sheet for Web Scraping using RR

A comprehensive cheat sheet and reference for web scraping in R using rvest, httr, and RSelenium.

#r-programming#webscraping#httr

An Elixir library for structured data extraction from websites, articles, and RSS/Atom feeds using information-retrieval techniques.

#readability#elixir#information-retrieval

Stars337

Forks41

#elixir#css-selectors#html5

meseeksElixir

An Elixir library for parsing and extracting data from HTML and XML using CSS or XPath selectors.

Stars325

Forks26

#elixir#css-selectors#nif

meeseeksElixir

An Elixir library for parsing and extracting data from HTML and XML using CSS or XPath selectors.

Stars325

Forks26

#java-library#non-blocking#streaming-json

JsonSurferJava

A streaming JsonPath processor for Java that extracts JSON data without loading entire documents into memory.

Stars317

Forks58

#archive-creation#filesystem#iso9660

iso9660Go

A Go library for reading and creating ISO9660 disk images with experimental Rock Ridge support.

Stars285

Forks45

#unmarshall#unmarshaling#css-selectors

goqGo

A declarative struct-tag-based HTML unmarshaling and web scraping library for Go built on goquery.

Stars270

Forks21

Last commit4 years ago

antchGo

A fast, powerful, and extensible web crawling and scraping framework for Go, inspired by Scrapy.

#web-crawling#concurrent#crawler

Stars266

Forks40

#elixir-lang#elixir#libreoffice

xlsxirElixir

An Elixir library for parsing .xlsx files using SAX parsing and storing data in ETS for efficient access.

Stars219

Forks106

Last commit5 months ago

gojqGo

A Go library for querying JSON data with a simple expression syntax, making JSON parsing and type assertion easier.

#json-query#go-package#golang-library

Stars191

Forks23

Last commit3 years ago

Go Get CrawlGo

A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.

#crawler#go-library#wayback-machine

Stars183

Forks17