Web Crawler

31 projects

Showing 31 of 31 projects

An open-source web crawler and scraper that converts web content into clean, LLM-ready Markdown for RAG, agents, and data pipelines.

#playwright#ai-agents#markdown-generation

An incredibly fast web crawler designed for OSINT (Open Source Intelligence) data extraction.

#information-gathering#spider#osint

Stars13.1k

Forks1.7k

Last commit5 months ago

webmagicJava

A scalable Java framework for building web crawlers, covering downloading, URL management, content extraction, and persistence.

#distributed-systems#crawler#html-parsing

Stars11.7k

Forks4.1k

Last commit7 months ago

Node-CrawlerTypeScript

A Node.js web crawler with server-side jQuery, rate limiting, and proxy support for efficient scraping.

#proxy-support#jquery#spider

Stars6.8k

Forks866

Last commit1 month ago

Crawler4jJava

An open-source Java web crawler that provides a simple interface for multi-threaded web crawling.

#java-library#open-source#crawling-framework

Stars4.6k

Forks1.9k

Last commit4 years ago

TorBotPython

An open-source intelligence (OSINT) tool for crawling and analyzing websites on the dark web and beyond.

#python-web-crawler#spider#osint

Stars4.4k

Forks705

Last commit2 days ago

Heritrix Q&AJava

An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.

#webcrawling#digital-preservation#warc

Stars3.3k

Forks792

Last commit7 days ago

HeritrixJava

An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.

#webcrawling#digital-preservation#warc

Stars3.3k

Forks792

Last commit7 days ago

webclawRust

A fast, local-first web scraper and content extractor optimized for AI agents, with CLI, REST API, and MCP server.

#content-extraction#crawler#cli-tool

Stars1.8k

Forks197

Last commit22 hours ago

uCSSJavaScript

A Node.js tool for crawling websites to find unused and duplicate CSS selectors.

#performance-optimization#frontend-tooling#nodejs

Stars1.6k

Forks62

Last commit9 years ago

grab-sitePython

A preconfigured web crawler for backing up websites, producing WARC files with a live dashboard and dynamic ignore patterns.

#archiving-tools#spider#archiving

Stars1.6k

Forks157

Last commit1 year ago

WombatRuby

A lightweight Ruby web crawler and scraper with an elegant DSL for extracting structured data from web pages.

#dsl#crawler#ruby-gem

Stars1.4k

Forks128

Last commit3 months ago

PHP SpiderPHP

A configurable and extensible PHP web spider for crawling and scraping websites with support for breadth-first/depth-first traversal, caching, and custom filters.

#event-driven#caching#css-selectors

Stars1.3k

Forks231

Last commit22 days ago

XSRFProbePython

An advanced Cross-Site Request Forgery (CSRF) audit and exploitation toolkit for security testing.

#python-tool#csrf-attacks#owasp

Stars1.3k

Forks219

Last commit10 days ago

Browsertrix CrawlerTypeScript

A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.

#webrecorder#puppeteer#digital-preservation

Stars1.1k

Forks147

Last commit6 hours ago

storm-crawlerJava

A scalable, mature, and versatile web crawler built on Apache Storm for building low-latency, distributed crawling systems.

#distributed#real-time-processing#distributed-systems

Stars986

Forks285

Last commit15 hours ago

CrawlerElixir

A high-performance web crawler and scraper built in Elixir with worker pooling and rate limiting.

#elixir#spider#offline

Stars958

Forks89

Last commit28 days ago

SpidrRuby

A versatile Ruby web spidering library for crawling sites, domains, or specific links with extensive filtering and callback support.

#web-crawling#spider#crawler

Stars835

Forks107

Last commit6 months ago

Google Play crawler (Java)Java

A Java API for searching and downloading Android applications from Google Play, with device emulation capabilities.

#java-library#apk-downloader#android

Stars596

Forks213

Last commit2 years ago

hypheJavaScript

A research-driven web crawler for building and analyzing curated web corpora as networks of web entities.

#research-tool#web-entities#corpus-curation

Stars384

Forks62

Last commit2 months ago

crawleyGo

A fast, Unix-style command-line web crawler that extracts links, resources, and API endpoints from web pages.

#api-discovery#resource-discovery#link-extraction

Stars340

Forks18

Last commit4 days ago

antchGo

A fast, powerful, and extensible web crawling and scraping framework for Go, inspired by Scrapy.

#web-crawling#concurrent#crawler

Stars266

Forks40

Last commit6 years ago

SquidwarcJavaScript

A high-fidelity, user-scriptable archival web crawler using Chrome/Chromium to preserve JavaScript-rendered content.

#high-fidelity-preservation#chrome#puppeteer

Stars178

Forks25

Last commit6 years ago

dyerRust

A reliable, flexible, and fast Rust framework for web crawling and request-response services.

#event-driven#web-crawling#spider

Stars126

Forks7

Last commit11 months ago

ChroniclerJavaScript

An offline-first web browser that archives, searches, and crawls websites for personal use.

#offline-browser#personal-archive#warc

Stars92

Forks8

Last commit7 years ago

oeisPython

Python tools to download, process, and analyze data from the Online Encyclopedia of Integer Sequences (OEIS).

#mathematics#oeis#python

Stars51

Forks38

Last commit1 year ago

boomerangPython

A client-minion tool for consistent and safe capture of off-network web resources during security investigations.

#client-server#rest-api#operational-security

Stars39

Forks6

Last commit9 years ago

test-crawlerTypeScript

A visual regression testing tool that crawls websites and provides snapshot comparison reports.

#visual-regression-testing#puppeteer#ui-testing

Stars33

Forks5

Last commit4 years ago

Web2WarcScala

A customizable Scala crawler for creating personal web archives in WARC/CDX format.

#digital-preservation#cdx#warc

Stars26

Forks3

Last commit8 years ago

Heritrix WalkthroughShell

A virtual machine and walkthrough for setting up and using the Heritrix web crawler for web archiving.

#digital-preservation#research-tools#virtual-machine

Stars10

Forks1

Last commit10 years ago

Linkbuilding SpiderPHP

A PHP tool that checks target websites for links to your site or competitors' sites for SEO analysis.

#link-building#backlink-checker#marketing-automation

Stars8

Forks2

Last commit3 years ago

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub