Web Scraping

190 projects

Showing 36 of 190 projects

PuppeteerSharpC#

A .NET port of the official Node.js Puppeteer API for headless browser automation.

#chrome#puppeteer#screenshot

Stars3.9k

Forks486

Last commit2 days ago

html-to-markdownGo

A robust HTML to Markdown converter with plugin support, usable as a Go library, CLI tool, or via hosted API.

#developer-tools#plugin-system#commonmark

Stars3.8k

Forks216

Last commit11 days ago

ScrapePython

A Python module to bypass Cloudflare's anti-bot page by solving JavaScript challenges using Node.js.

#scraping-websites#anti-bot#cloudflare-bypass

Stars3.5k

Forks452

Last commit2 years ago

playwright-goGo

A Go library to automate Chromium, Firefox, and WebKit browsers with a single API for cross-browser web automation.

#playwright#hacktoberfest#headless-chrome

Stars3.4k

Forks237

Last commit7 days ago

playwright-goGo

A Go library for cross-browser automation, controlling Chromium, Firefox, and WebKit with a single API.

#playwright#hacktoberfest#headless-chrome

Stars3.4k

Forks237

Last commit7 days ago

Symfony PantherPHP

A PHP and Symfony library for browser testing and web scraping using real browsers via the WebDriver protocol.

#hacktoberfest#selenium-webdriver#chromedriver

Stars3.1k

Forks232

Last commit1 month ago

slimerjsJavaScript

A scriptable browser based on Firefox's Gecko engine, compatible with PhantomJS API for web automation and testing.

#javascript-testing#slimerjs#gecko-engine

Stars3.0k

Forks255

Last commit3 years ago

playwright-dotnetC#

Official .NET library for cross-browser web automation and testing with Chromium, Firefox, and WebKit.

#playwright#chrome#csharp

Stars3.0k

Forks303

Last commit29 days ago

cnn_captchaPython

A TensorFlow-based CNN solution for recognizing character-based CAPTCHAs, providing training, validation, and API modules.

#flask#captcha-recognition#python

Stars2.9k

Forks785

Last commit3 years ago

gofeedGo

A robust Go library for parsing RSS, Atom, and JSON feeds with support for extensions and invalid feed handling.

#rss-feed#atom-feed#rss

Stars2.9k

Forks219

Last commit4 days ago

Chrome PHPPHP

A PHP library to control headless Chrome/Chromium instances for browser automation, screenshots, and PDF generation.

#dom-manipulation#hacktoberfest#screenshot

Stars2.7k

Forks321

Last commit18 days ago

scraperRust

A Rust library for parsing HTML and querying elements using CSS selectors.

#dom-manipulation#hacktoberfest#css-selectors

Stars2.4k

Forks127

Last commit4 days ago

DiDOMPHP

A simple and fast HTML and XML parser for PHP with CSS selector and XPath support.

#dom-manipulation#css-selectors#php-library

Stars2.2k

Forks200

Last commit5 months ago

html2textPython

A Python library and CLI tool that converts HTML into clean, readable Markdown-formatted plain text.

#python-library#markdown-parser#plain-text

Stars2.2k

Forks297

Last commit8 months ago

flokiElixir

A simple HTML parser for Elixir that enables search for nodes using CSS selectors.

#css-selector#elixir#css-selectors

Stars2.1k

Forks163

Last commit1 month ago

EmbedPHP

A PHP library to extract metadata, embed codes, and structured data from any web page using multiple protocols.

#embeds#social-media#metadata-extraction

Stars2.1k

Forks324

Last commit16 days ago

FerrumRuby

A high-level Ruby API for controlling Chrome/Chromium browsers directly via the Chrome DevTools Protocol.

#web-crawling#developer-tools#chrome

Stars2.0k

Forks168

Last commit18 days ago

FerrumRuby

A high-level Ruby API for controlling Chrome/Chromium via the Chrome DevTools Protocol without Selenium dependencies.

#developer-tools#chrome#headless-chrome

Stars2.0k

Forks168

Last commit18 days ago

webclawRust

A fast, local-first web scraper and content extractor optimized for AI agents, with CLI, REST API, and MCP server.

#content-extraction#crawler#cli-tool

Advanced Go HTTP client with browser impersonation, TLS fingerprinting, HTTP/3 support, and anti-bot bypass for web automation.

#browser-impersonation#anti-bot#http3

Stars1.8k

Forks93

Last commit20 days ago

ruiaPython

An async Python web scraping micro-framework built on asyncio and aiohttp for fast, extensible crawling.

#python-3#asyncio#aiohttp

Stars1.7k

Forks186

Last commit3 years ago

UptonHTML

A batteries-included Ruby framework for easy web-scraping with built-in debug mode and rate limiting.

#debugging-tools#nokogiri#crawler

Stars1.6k

Forks108

Last commit7 years ago

rvest <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

A tidyverse package for web scraping in R, inspired by Beautiful Soup and designed for data extraction workflows.

#r-package#r-language#html-parsing

A lightweight Ruby web crawler and scraper with an elegant DSL for extracting structured data from web pages.

#dsl#crawler#ruby-gem

Stars1.4k

Forks128

Last commit3 months ago

PHP SpiderPHP

A configurable and extensible PHP web spider for crawling and scraping websites with support for breadth-first/depth-first traversal, caching, and custom filters.

#event-driven#caching#css-selectors

Stars1.3k

Forks231

Last commit25 days ago

PuPHPeteerPHP

A PHP bridge to Puppeteer that provides full API support for browser automation from PHP applications.

#developer-tools#puppeteer#headless-chrome

Stars1.3k

Forks209

Last commit3 years ago

rotating-proxyRuby

A Docker container that provides a rotating proxy service using multiple Tor circuits for IP rotation.

#http-proxy#polipo#haproxy

Stars1.2k

Forks253

Last commit2 years ago

WKZombieSwift

A Swift headless browser framework for iOS/OSX to automate website navigation, data collection, and testing without a UI.

#functional-programming#ios#osx

Stars1.2k

Forks100

Last commit5 years ago

justhtmlPython

A pure Python HTML5 parser with spec-perfect parsing, built-in sanitization, CSS selectors, and zero dependencies.

#dom-manipulation#sanitization#pure-python

Stars1.1k

Forks41

Last commit4 days ago

Selenium

A curated collection of Selenium resources including tools, drivers, containers, cloud services, and testing frameworks.

#selenium#awesome-list#browser-testing

Stars1.1k

Forks174

Last commit4 months ago

CrawlyElixir

A high-level web crawling and scraping framework for Elixir, designed for data extraction and processing.

#scraping-websites#elixir#web-crawling

Stars1.1k

Forks122

Last commit1 year ago

KimuraiRuby

Write web scrapers in Ruby using a clean, AI-assisted DSL that caches selectors for fast, LLM-free extraction.

#mechanize#antidetect-browser#camoufox

Stars1.1k

Forks162

Last commit5 months ago

MetaInspectorRuby

A Ruby gem for web scraping that extracts titles, meta tags, links, images, and structured data from URLs.

#link-extraction#nokogiri#metadata-extraction

Stars1.0k

Forks165

Last commit2 months ago

select.rsRust

A Rust library for extracting structured data from HTML documents, designed for web scraping tasks.

#css-selectors#dom-traversal#html-parsing

Stars1.0k

Forks68

Last commit1 year ago

storm-crawlerJava

A scalable, mature, and versatile web crawler built on Apache Storm for building low-latency, distributed crawling systems.

#distributed#real-time-processing#distributed-systems

A bullet-proof, fast, and reliable headless browser API for Chrome automation and testing.

#chrome#headless-chrome#graphql

Stars974

Forks34

Last commit8 years ago

PreviousPage 2 of 6

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub