Web Crawling

22 projects

Showing 22 of 22 projects

A Python library and CLI tool for web crawling, scraping, and extracting main text, metadata, and comments from web pages.

#text-extraction#readability#article-extractor

Stars6.3k

Forks397

Last commit6 days ago

DotnetSpiderC#

A lightweight, efficient, and fast high-level web crawling and scraping framework for .NET.

#web-crawling#distributed#redis

Stars4.1k

Forks1.1k

Last commit3 months ago

SitemapGeneratorRuby

A framework-agnostic Ruby gem for generating XML sitemaps with Rails integration and support for multiple sitemap extensions.

#web-crawling#search-engine-optimization#ruby-gem

Stars2.5k

Forks277

Last commit21 hours ago

FerrumRuby

A high-level Ruby API for controlling Chrome/Chromium browsers directly via the Chrome DevTools Protocol.

#web-crawling#developer-tools#chrome

Stars2.0k

Forks168

Last commit18 days ago

domain_analyzerPython

A Python security analysis tool that automatically discovers and reports comprehensive information about a given domain.

#python-tool#dns-analysis#web-crawling

Stars1.9k

Forks236

Last commit3 years ago

CrawlyElixir

A high-level web crawling and scraping framework for Elixir, designed for data extraction and processing.

#scraping-websites#elixir#web-crawling

Stars1.1k

Forks122

Last commit1 year ago

SpidrRuby

A versatile Ruby web spidering library for crawling sites, domains, or specific links with extensive filtering and callback support.

#web-crawling#spider#crawler

Stars835

Forks107

Last commit6 months ago

antchGo

A fast, powerful, and extensible web crawling and scraping framework for Go, inspired by Scrapy.

#web-crawling#concurrent#crawler

Stars266

Forks40

Last commit6 years ago

URL-to-PNGTypeScript

A self-hosted URL to PNG generator with parallel rendering via Playwright and configurable storage caching.

#playwright#web-crawling#amazon-s3

Stars251

Forks37

Last commit1 day ago

go-sitemap-generatorGo

A Go library for generating various types of XML sitemaps with support for search engine pinging and cloud storage.

#web-crawling#library#xml-generation

Stars231

Forks65

Last commit2 years ago

warctoolsPython

Python command-line tools and libraries for handling, validating, and converting WARC and ARC web archive files.

#web-crawling#command-line-tools#python-library

Stars176

Forks33

Last commit11 months ago

aws-pdf-textract-pipelineTypeScript

Serverless data pipeline for crawling PDFs from the web and extracting structured data using AWS Textract.

#lambda#web-crawling#aws-textract

Stars165

Forks20

Last commit2 years ago

dyerRust

A reliable, flexible, and fast Rust framework for web crawling and request-response services.

#event-driven#web-crawling#spider

Stars126

Forks7

Last commit11 months ago

sitemapElixir

An Elixir library for generating sitemap.xml files with support for news, image, video, and mobile sitemaps.

#elixir#web-crawling#phoenix-framework

Stars105

Forks23

Last commit3 years ago

robotstxtRust

A native Rust port of Google's robots.txt parser and matcher library, preserving all original behavior.

#web-crawling#library#web-standards

Stars102

Forks13

Last commit5 years ago

httrack2warcJava

Converts HTTrack website crawls into standardized WARC files for web archiving and preservation.

#web-crawling#digital-preservation#httrack-migration

Stars34

Forks6

Last commit1 year ago

textokenRuby

A Ruby gem for customizable text tokenization, useful for web crawling and natural language processing.

#web-crawling#text-tokenization#regex

Stars31

Forks3

Last commit4 years ago

vlGo

A CLI tool that verifies the current status of URIs in text files like markdown documentation.

#web-crawling#link-checker#validation

Stars30

Forks3

Last commit2 years ago

SpiderManElixir

A fast high-level web crawling and scraping framework for Elixir, built on Broadway.

#elixir#web-crawling#spider

Stars28

Forks6

Last commit9 months ago

go-sitemap-parserGo

A Go library for parsing XML sitemaps, robots.txt, and gzipped sitemaps with configurable rules and concurrent fetching.

#sitemapxml#web-crawling#concurrent-processing

Stars7

Forks1

Last commit12 days ago

sitemap-formatGo

A Go library for generating XML sitemaps with a simple, fluent API.

#web-crawling#static-site#xml-generation

Stars6

Forks0

Last commit3 years ago

X.Web.SitemapC#

A .NET library for generating and managing XML sitemaps and sitemap indexes for websites.

#web-crawling#library#csharp

Stars5

Forks0

Last commit1 year ago

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub