Web Scraping

#web-crawling#spider#crawler

SpidrRuby

A versatile Ruby web spidering library for crawling sites, domains, or specific links with extensive filtering and callback support.

Stars835

Forks107

Last commit6 months ago

ReadabilityKitSwift

A Swift library for extracting article previews including title, description, images, and metadata from web pages.

#content-parsing#ios#metadata-extraction

Stars835

Forks79

#chrome#puppeteer#screenshot

jvppeteerJava

A Java API for controlling Chrome and Firefox browsers via DevTools and WebDriver-bidi protocols.

Stars808

Forks170

Last commit12 days ago

suckitRust

A Rust-based command-line tool for recursively downloading entire websites for offline browsing.

#hacktoberfest#archiving#webscraping

Stars806

Forks44

Last commit4 months ago

cdpGo

Type-safe Go bindings for the Chrome DevTools Protocol, enabling browser automation and debugging.

#rpc-client#cdp#devtools-protocol

Stars794

Forks50

Last commit7 months ago

htmlqueryGo

A Go package for querying HTML documents using XPath expressions with built-in caching for performance.

#caching#xpath-selector#html-parsing

Stars784

Forks80

Last commit22 days ago

image-scraperPython

A high-performance, multithreaded command-line tool for downloading images from webpages.

#pypi#commandline-tool#terminal

Stars776

Forks104

Last commit8 years ago

EssencePHP

A PHP library for extracting media information from web pages like YouTube videos, Twitter statuses, and blog articles.

#metadata-parsing#url-crawling#php-library

Stars772

Forks80

#xpath-query#selects-descendants#document-query

xpathGo

A Go package for querying XML, HTML, and JSON documents using XPath expressions.

Stars743

Forks98

Last commit1 day ago

fattest-catJavaScript

A script to find the fattest cat currently available for adoption at the San Francisco SPCA.

#fun-project#san-francisco#pet-adoption

Stars736

Forks38

#chrome-fetcher#scraping-websites#javascript-rendering

dataflowkitGo

A Go web scraping framework that extracts structured data from websites using CSS selectors, including JavaScript-rendered pages.

Stars715

Forks83

#dom-manipulation#hiccup#clojurescript

HickoryClojure

A Clojure/ClojureScript library that parses HTML into Clojure data structures for analysis, transformation, and serialization.

Stars678

Forks55

Last commit3 months ago

pychromePython

A Python package for controlling Google Chrome/Chromium via the Chrome DevTools Protocol with a threading-based API.

#threading#chrome#headless-chrome

Stars647

Forks116

tor-browser-seleniumPython

A Python library for automating Tor Browser with Selenium WebDriver for privacy-focused web scraping and testing.

#selenium#privacy#onion-services

Stars594

Forks104

Steam CommunityJavaScript

A Node.js library for interacting with Steam Community's website interfaces, including login, trading, and inventory management.

#trading-bot#steam#steam-community

Stars575

Forks156

Last commit11 days ago

LinkThumbnailerRuby

Ruby gem that fetches images and metadata from URLs to generate link previews, similar to social media previews.

#content-parsing#thumbnail-generation#metadata-extraction

Stars510

Forks105

playwright-ruby-clientRuby

A Ruby client library for browser automation and testing using Microsoft Playwright.

#playwright#ruby-gem#headless-browser

Stars499

Forks52

Last commit3 days ago

HLTVTypeScript

An unofficial Node.js API for programmatically accessing HLTV's Counter-Strike esports data, including matches, teams, players, and live scores.

#statistics#live-scores#scraper

Stars493

Forks126

#proxy-support#ja3-fingerprint#browser-emulation

azuretls-clientGo

A Go HTTP client that spoofs TLS/JA3, HTTP/2, and HTTP/3 fingerprints to emulate real browsers by default.

Stars464

Forks65

Last commit3 months ago

deno-puppeteerTypeScript

A port of the Puppeteer browser automation library to run natively on Deno.

#puppeteer#screenshot#headless-chrome

Stars458

Forks47

#content-preview#metadata-extraction#link-preview

Android-Link-PreviewJava

An Android library that generates link previews by extracting titles, descriptions, and images from URLs.

Stars414

Forks130

Last commit6 years ago

NokolexborC

A high-performance, Nokogiri-compatible HTML5 parser for Ruby with CSS selector and XPath support.

#dom-manipulation#css-selectors#html5

Stars414

Forks8

Last commit22 days ago

Lambda SoupOCaml

A functional HTML scraping and manipulation library for OCaml with CSS selector support.

#ocaml-library#functional-programming#css-selectors

Stars409

Forks35

#go-client#go-library#headless-browser

godetGo

A Go client library for remotely controlling Chrome/Chromium browsers via the Chrome DevTools Protocol.

Stars398

Forks43

Last commit4 months ago

Web Scraping Reference: Cheat Sheet for Web Scraping using RR

A comprehensive cheat sheet and reference for web scraping in R using rvest, httr, and RSelenium.

#r-programming#webscraping#httr

A versatile Rust tool for generating and mutating wordlists using patterns, web scraping, and password formats.

#cracking#hash#infosec

Stars390

Forks22

Last commit5 months ago

hgetHTML

A CLI and API tool that converts HTML into plain text, Markdown, or filtered HTML for terminal viewing.

#developer-tools#terminal-utility#content-extraction

Stars388

Forks13

#cookies#privacy-tools#cookie-extraction

rookieRust

A cross-platform library to load and decrypt cookies from any web browser, built with Rust for speed and safety.

Stars364

Forks48

Last commit6 months ago

read-artJavaScript

A Node.js library to automatically scrape and extract readable article content from any web page, supporting both English and Chinese.

#readability#content-extraction#crawler

Stars346

Forks36

Last commit8 years ago

crawleyGo

A fast, Unix-style command-line web crawler that extracts links, resources, and API endpoints from web pages.

#api-discovery#resource-discovery#link-extraction

Stars340

Forks18

Last commit7 days ago

scrapeElixir

An Elixir library for structured data extraction from websites, articles, and RSS/Atom feeds using information-retrieval techniques.

#readability#elixir#information-retrieval

Stars337

Forks41

Last commit6 years ago

meseeksElixir

An Elixir library for parsing and extracting data from HTML and XML using CSS or XPath selectors.

#elixir#css-selectors#html5

Stars325

Forks26

#elixir#css-selectors#nif

meeseeksElixir

An Elixir library for parsing and extracting data from HTML and XML using CSS or XPath selectors.

Stars325

Forks26

#content-parsing#data-fetching#ruby-gem

wikipediaRuby

A Ruby client library for interacting with the Wikipedia API, providing easy access to articles, summaries, images, and metadata.

Stars309

Forks74