Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Tags
  3. Data Extraction

Data Extraction

86 projects

Showing 36 of 86 projects

Crawly
CrawlyElixir

A high-level web crawling and scraping framework for Elixir, designed for data extraction and processing.

#scraping-websites#elixir#web-crawling
Stars1.1k
Forks123
Last commit10 months ago
select.rs
select.rsRust

A Rust library for extracting structured data from HTML documents, designed for web scraping tasks.

#css-selectors#dom-traversal#html-parsing
Stars1.0k
Forks68
Last commit1 year ago
storm-crawler
storm-crawlerJava

A scalable, mature, and versatile web crawler built on Apache Storm for building low-latency, distributed crawling systems.

#distributed#real-time-processing#distributed-systems
Stars979
Forks277
Last commit3 days ago
Crawler
CrawlerElixir

A high-performance web crawler and scraper built in Elixir with worker pooling and rate limiting.

#elixir#spider#offline
Stars958
Forks90
Last commit2 years ago
CoreXLSX
CoreXLSXSwift

A pure Swift library for parsing and reading Excel XLSX spreadsheet files.

#excelreader#ios#codable
Stars905
Forks109
Last commit2 years ago
CoreXLSX
CoreXLSXSwift

A pure Swift library for parsing and reading Excel XLSX spreadsheet files.

#excelreader#ios#codable
Stars905
Forks109
Last commit2 years ago
commonregex
commonregexGo

A Go library providing pre-built regular expressions for common patterns like dates, emails, and phone numbers.

#developer-tools#regex#validation
Stars898
Forks71
Last commit6 years ago
bit7z
bit7zC++

A C++ static library providing a clean, cross-platform interface to 7-Zip for archive compression and extraction.

#bzip2#tar#static-library
Stars840
Forks143
Last commit1 day ago
Spidr
SpidrRuby

A versatile Ruby web spidering library for crawling sites, domains, or specific links with extensive filtering and callback support.

#web-crawling#spider#crawler
Stars836
Forks107
Last commit4 months ago
xidel
xidelPascal

A command-line tool to extract data from HTML/XML pages and JSON APIs using CSS, XPath, XQuery, JSONiq, and pattern matching.

#rest#css-selectors#http
Stars836
Forks45
Last commit1 year ago
pdf_oxide
pdf_oxideRust

A high-performance PDF toolkit for text/image extraction, markdown conversion, and PDF editing, built in Rust with Python, WASM, CLI, and MCP server bindings.

#text-extraction#open-source#pdf-parser
Stars804
Forks88
Last commit1 day ago
htmlquery
htmlqueryGo

A Go package for querying HTML documents using XPath expressions with built-in caching for performance.

#caching#xpath-selector#html-parsing
Stars784
Forks80
Last commit15 days ago
xpath
xpathGo

A Go package for querying XML, HTML, and JSON documents using XPath expressions.

#xpath-query#selects-descendants#document-query
Stars740
Forks92
Last commit3 months ago
dataflowkit
dataflowkitGo

A Go web scraping framework that extracts structured data from websites using CSS selectors, including JavaScript-rendered pages.

#chrome-fetcher#scraping-websites#javascript-rendering
Stars714
Forks84
Last commit3 years ago
jsongrep
jsongrepRust

A command-line tool and Rust library for fast querying of JSON, YAML, TOML, and other documents using regular path expressions.

#search#developer-tools#yaml
Stars646
Forks11
Last commit1 month ago
url-pattern
url-patternCoffeeScript

A JavaScript library for matching and generating strings using patterns easier than regex, ideal for URL routing and data extraction.

#pattern-parsing#regex-alternative#string-matching
Stars588
Forks43
Last commit5 years ago
libxls
libxlsC

A C library for reading binary Excel (XLS) files with a command-line tool for converting XLS to CSV.

#c-library#binary-file#xls
Stars530
Forks145
Last commit10 months ago
Awesome Website Change Monitoring
Awesome Website Change Monitoring

A comprehensive curated list of open-source and hosted tools for monitoring and detecting changes on websites.

#change-detection#diffing#awesome-list
Stars512
Forks37
Last commit7 months ago
xmlquery
xmlqueryGo

A Go package for querying XML documents using XPath expressions with built-in caching for performance.

#go-library#streaming-parser#utf-16-support
Stars488
Forks94
Last commit2 months ago
DeepPluck
DeepPluckRuby

A Ruby gem that efficiently plucks attributes from nested ActiveRecord associations without loading full records.

#rubygems#rails-optimization#rails
Stars459
Forks14
Last commit1 year ago
Toxy
ToxyC#

A .NET framework for extracting and exporting text and data from a wide variety of document formats.

#text-extraction#fileformats#office-documents
Stars451
Forks115
Last commit1 day ago
Nokolexbor
NokolexborC

A high-performance, Nokogiri-compatible HTML5 parser for Ruby with CSS selector and XPath support.

#dom-manipulation#css-selectors#html5
Stars410
Forks8
Last commit1 month ago
Lambda Soup
Lambda SoupOCaml

A functional HTML scraping and manipulation library for OCaml with CSS selector support.

#ocaml-library#functional-programming#css-selectors
Stars407
Forks35
Last commit1 year ago
Web Scraping Reference: Cheat Sheet for Web Scraping using R
Web Scraping Reference: Cheat Sheet for Web Scraping using RR

A comprehensive cheat sheet and reference for web scraping in R using rvest, httr, and RSelenium.

#r-programming#webscraping#httr
Stars397
Forks101
Last commit
scrape
scrapeElixir

An Elixir library for structured data extraction from websites, articles, and RSS/Atom feeds using information-retrieval techniques.

#readability#elixir#information-retrieval
Stars337
Forks41
Last commit5 years ago
meseeks
meseeksElixir

An Elixir library for parsing and extracting data from HTML and XML using CSS or XPath selectors.

#elixir#css-selectors#html5
Stars324
Forks26
Last commit1 year ago
meeseeks
meeseeksElixir

An Elixir library for parsing and extracting data from HTML and XML using CSS or XPath selectors.

#elixir#css-selectors#nif
Stars324
Forks26
Last commit1 year ago
JsonSurfer
JsonSurferJava

A streaming JsonPath processor for Java that extracts JSON data without loading entire documents into memory.

#java-library#non-blocking#streaming-json
Stars316
Forks58
Last commit2 years ago
iso9660
iso9660Go

A Go library for reading and creating ISO9660 disk images with experimental Rock Ridge support.

#archive-creation#filesystem#iso9660
Stars285
Forks44
Last commit2 years ago
goq
goqGo

A declarative struct-tag-based HTML unmarshaling and web scraping library for Go built on goquery.

#unmarshall#unmarshaling#css-selectors
Stars270
Forks21
Last commit4 years ago
antch
antchGo

A fast, powerful, and extensible web crawling and scraping framework for Go, inspired by Scrapy.

#web-crawling#concurrent#crawler
Stars267
Forks40
Last commit6 years ago
xlsxir
xlsxirElixir

An Elixir library for parsing .xlsx files using SAX parsing and storing data in ETS for efficient access.

#elixir-lang#elixir#libreoffice
Stars219
Forks104
Last commit3 months ago
gojq
gojqGo

A Go library for querying JSON data with a simple expression syntax, making JSON parsing and type assertion easier.

#json-query#go-package#golang-library
Stars190
Forks23
Last commit2 years ago
Go Get Crawl
Go Get CrawlGo

A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.

#crawler#go-library#wayback-machine
Stars179
Forks17
Last commit1 year ago
ArchiveSpark
ArchiveSparkScala

An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.

#data-lineage#apache-spark#web-archives
Stars161
Forks19
Last commit8 months ago
Rattletrap
RattletrapHaskell

A Haskell command-line tool to parse Rocket League replays into JSON and generate replays from JSON.

#haskell#rocket-league#cli-tool
Stars158
Forks21
Last commit10 months ago
PreviousPage 2 of 3

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub
3 years ago
Next
#Web Scraping33
#Crawler16
#Css Selectors13
#Html13
#Golang11
#Automation11
#Web Crawler11
#Go11
#Xpath10
#Xml9
#Html Parsing8
#Command Line Tool8