Showing 36 of 86 projects
A high-level web crawling and scraping framework for Elixir, designed for data extraction and processing.
A Rust library for extracting structured data from HTML documents, designed for web scraping tasks.
A scalable, mature, and versatile web crawler built on Apache Storm for building low-latency, distributed crawling systems.
A high-performance web crawler and scraper built in Elixir with worker pooling and rate limiting.
A pure Swift library for parsing and reading Excel XLSX spreadsheet files.
A pure Swift library for parsing and reading Excel XLSX spreadsheet files.
A Go library providing pre-built regular expressions for common patterns like dates, emails, and phone numbers.
A C++ static library providing a clean, cross-platform interface to 7-Zip for archive compression and extraction.
A versatile Ruby web spidering library for crawling sites, domains, or specific links with extensive filtering and callback support.
A command-line tool to extract data from HTML/XML pages and JSON APIs using CSS, XPath, XQuery, JSONiq, and pattern matching.
A high-performance PDF toolkit for text/image extraction, markdown conversion, and PDF editing, built in Rust with Python, WASM, CLI, and MCP server bindings.
A Go package for querying HTML documents using XPath expressions with built-in caching for performance.
A Go package for querying XML, HTML, and JSON documents using XPath expressions.
A Go web scraping framework that extracts structured data from websites using CSS selectors, including JavaScript-rendered pages.
A command-line tool and Rust library for fast querying of JSON, YAML, TOML, and other documents using regular path expressions.
A JavaScript library for matching and generating strings using patterns easier than regex, ideal for URL routing and data extraction.
A C library for reading binary Excel (XLS) files with a command-line tool for converting XLS to CSV.
A comprehensive curated list of open-source and hosted tools for monitoring and detecting changes on websites.
A Go package for querying XML documents using XPath expressions with built-in caching for performance.
A Ruby gem that efficiently plucks attributes from nested ActiveRecord associations without loading full records.
A .NET framework for extracting and exporting text and data from a wide variety of document formats.
A high-performance, Nokogiri-compatible HTML5 parser for Ruby with CSS selector and XPath support.
A functional HTML scraping and manipulation library for OCaml with CSS selector support.
A comprehensive cheat sheet and reference for web scraping in R using rvest, httr, and RSelenium.
An Elixir library for structured data extraction from websites, articles, and RSS/Atom feeds using information-retrieval techniques.
An Elixir library for parsing and extracting data from HTML and XML using CSS or XPath selectors.
An Elixir library for parsing and extracting data from HTML and XML using CSS or XPath selectors.
A streaming JsonPath processor for Java that extracts JSON data without loading entire documents into memory.
A Go library for reading and creating ISO9660 disk images with experimental Rock Ridge support.
A declarative struct-tag-based HTML unmarshaling and web scraping library for Go built on goquery.
A fast, powerful, and extensible web crawling and scraping framework for Go, inspired by Scrapy.
An Elixir library for parsing .xlsx files using SAX parsing and storing data in ETS for efficient access.
A Go library for querying JSON data with a simple expression syntax, making JSON parsing and type assertion easier.
A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.
An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.
A Haskell command-line tool to parse Rocket League replays into JSON and generate replays from JSON.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.