Crawler

50 projects

Showing 36 of 50 projects

scrapyPython

Scrapy, a fast high-level web crawling & scraping framework for Python.

#hacktoberfest#crawler#web-scraping-python

Stars63.3k

Forks11.8k

Last commit23 hours ago

collyGo

A fast and elegant scraping and crawling framework for Go, designed for extracting structured data from websites.

#spider#crawler#scraper

Stars25.4k

Forks1.9k

Last commit1 month ago

PhotonPython

An incredibly fast web crawler designed for OSINT (Open Source Intelligence) data extraction.

#information-gathering#spider#osint

Stars13.1k

Forks1.7k

Last commit5 months ago

webmagicJava

A scalable Java framework for building web crawlers, covering downloading, URL management, content extraction, and persistence.

#distributed-systems#crawler#html-parsing

Stars11.7k

Forks4.1k

Last commit7 months ago

Node-CrawlerTypeScript

A Node.js web crawler with server-side jQuery, rate limiting, and proxy support for efficient scraping.

#proxy-support#jquery#spider

Stars6.8k

Forks866

Last commit1 month ago

trafilaturaPython

A Python library and CLI tool for web crawling, scraping, and extracting main text, metadata, and comments from web pages.

#text-extraction#readability#article-extractor

Stars6.3k

Forks397

Last commit4 days ago

myGPTReaderPython

A Slack bot that reads and summarizes webpages, documents, and videos using ChatGPT, with voice chat capabilities.

#ai#content-summarization#embedding

Stars4.4k

Forks441

Last commit5 months ago

TorBotPython

An open-source intelligence (OSINT) tool for crawling and analyzing websites on the dark web and beyond.

#python-web-crawler#spider#osint

Stars4.4k

Forks705

Last commit2 days ago

DotnetSpiderC#

A lightweight, efficient, and fast high-level web crawling and scraping framework for .NET.

#web-crawling#distributed#redis

Stars4.1k

Forks1.1k

Last commit3 months ago

Puppeteer SharpC#

A .NET port of the official Node.js Puppeteer API for headless browser automation.

#aot-compilation#chrome#puppeteer

Stars3.9k

Forks486

Last commit4 days ago

PuppeteerSharpC#

A .NET port of the official Node.js Puppeteer API for headless browser automation.

#chrome#puppeteer#screenshot

Stars3.9k

Forks486

Last commit4 days ago

magnetissimoElixir

A self-hosted web application that indexes torrent sites and saves magnet links to a local database.

#elixir#no-javascript#phoenix-framework

Stars3.1k

Forks185

Last commit2 years ago

CrawlerDetectPHP

A PHP class for detecting bots, crawlers, and spiders via user agent and HTTP headers.

#bots#hacktoberfest#user-agent

Stars2.4k

Forks279

Last commit11 days ago

dirhuntPython

Find web directories without bruteforce

#dirscanner#without-bruteforce#crawler

Stars2.0k

Forks274

Last commit2 years ago

webclawRust

A fast, local-first web scraper and content extractor optimized for AI agents, with CLI, REST API, and MCP server.

#content-extraction#crawler#cli-tool

An async Python web scraping micro-framework built on asyncio and aiohttp for fast, extensible crawling.

#python-3#asyncio#aiohttp

Stars1.7k

Forks186

Last commit3 years ago

grab-sitePython

A preconfigured web crawler for backing up websites, producing WARC files with a live dashboard and dynamic ignore patterns.

#archiving-tools#spider#archiving

Stars1.6k

Forks157

Last commit1 year ago

UptonHTML

A batteries-included Ruby framework for easy web-scraping with built-in debug mode and rate limiting.

#debugging-tools#nokogiri#crawler

Stars1.6k

Forks108

Last commit7 years ago

SwiftLinkPreviewSwift

A Swift library for generating link previews (title, description, images) from URLs on Apple platforms.

#ios#metadata-extraction#crawler

Stars1.4k

Forks196

Last commit1 year ago

WombatRuby

A lightweight Ruby web crawler and scraper with an elegant DSL for extracting structured data from web pages.

#dsl#crawler#ruby-gem

Stars1.4k

Forks128

Last commit3 months ago

XSRFProbePython

An advanced Cross-Site Request Forgery (CSRF) audit and exploitation toolkit for security testing.

#python-tool#csrf-attacks#owasp

Stars1.3k

Forks219

Last commit10 days ago

FessJava

Open-source, self-hosted enterprise & site search server built on OpenSearch. Crawls web / file / DB / cloud sources, 20+ languages, REST API, and AI/RAG & semantic search. Apache-2.0.

#search#ai-search#crawler

A high-level web crawling and scraping framework for Elixir, designed for data extraction and processing.

#scraping-websites#elixir#web-crawling

Stars1.1k

Forks122

Last commit1 year ago

KimuraiRuby

Write web scrapers in Ruby using a clean, AI-assisted DSL that caches selectors for fast, LLM-free extraction.

#mechanize#antidetect-browser#camoufox

Stars1.1k

Forks162

Last commit5 months ago

Browsertrix CrawlerTypeScript

A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.

#webrecorder#puppeteer#digital-preservation

Stars1.1k

Forks147

Last commit12 hours ago

storm-crawlerJava

A scalable, mature, and versatile web crawler built on Apache Storm for building low-latency, distributed crawling systems.

#distributed#real-time-processing#distributed-systems

Stars986

Forks285

Last commit21 hours ago

CrawlerElixir

A high-performance web crawler and scraper built in Elixir with worker pooling and rate limiting.

#elixir#spider#offline

Stars958

Forks89

Last commit29 days ago

SpidrRuby

A versatile Ruby web spidering library for crawling sites, domains, or specific links with extensive filtering and callback support.

#web-crawling#spider#crawler

Stars835

Forks107

Last commit6 months ago

siteone-crawlerRust

A cross-platform website crawler and analyzer for SEO, security, accessibility, and performance optimization, built in Rust.

#seo-analysis#website-crawler#devops

Stars808

Forks73

Last commit23 days ago

jvppeteerJava

A Java API for controlling Chrome and Firefox browsers via DevTools and WebDriver-bidi protocols.

#chrome#puppeteer#screenshot

Stars808

Forks170

Last commit10 days ago

dataflowkitGo

A Go web scraping framework that extracts structured data from websites using CSS selectors, including JavaScript-rendered pages.

#chrome-fetcher#scraping-websites#javascript-rendering

Stars715

Forks83

Last commit3 years ago

read-artJavaScript

A Node.js library to automatically scrape and extract readable article content from any web page, supporting both English and Chinese.

#readability#content-extraction#crawler

Stars346

Forks36

Last commit8 years ago

crawleyGo

A fast, Unix-style command-line web crawler that extracts links, resources, and API endpoints from web pages.

#api-discovery#resource-discovery#link-extraction

Stars340

Forks18

Last commit4 days ago

antchGo

A fast, powerful, and extensible web crawling and scraping framework for Go, inspired by Scrapy.

#web-crawling#concurrent#crawler

Stars266

Forks40

Last commit6 years ago

Packagist MirrorPHP

A tool to create a local or public mirror of Packagist metadata for faster Composer package downloads in regions with slow internet.

#composer-packages#composer#devops

Stars200

Forks68

Last commit1 year ago

Go Get CrawlGo

A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.

#crawler#go-library#wayback-machine

Stars183

Forks17

Last commit1 year ago

Page 1 of 2Next

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub