Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Web Archiving
  3. webarchive-indexing

webarchive-indexing

MITPython

MapReduce tools for bulk indexing of web archive WARC/ARC files into ZipNum sharded CDX clusters on Hadoop, EMR, or local systems.

GitHubGitHub
46 stars12 forks0 contributors

Overview

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

Quick Stats

Stars46
Forks12
Contributors0
Open Issues4
Last commit8 years ago
CreatedSince 2015

Tags

#mapreduce#warc-indexing#python#web-archiving#hadoop

Built With

H
Hadoop
P
Python

Included in

Web Archiving2.5k
Auto-fetched 1 day ago

Related Projects

wikiteamwikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2026, WikiTeam has preserved more than 600,000 wikis.

Stars848
Forks175
Last commit5 months ago
warcdbwarcdb

WarcDB: Web crawl data as SQLite databases.

Stars405
Forks10
Last commit1 year ago
Go Get CrawlGo Get Crawl

Extract web archive data using Wayback Machine and Common Crawl

Stars180
Forks17
Last commit1 year ago
MemGatorMemGator

A Memento Aggregator CLI and Server in Go

Stars80
Forks13
Last commit2 months ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub