Warc

31 projects

Showing 31 of 31 projects

HeritrixJava

An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.

#webcrawling#digital-preservation#warc

Stars3.3k

Forks792

Last commit6 days ago

Heritrix Q&AJava

An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.

#webcrawling#digital-preservation#warc

Stars3.3k

Forks792

Last commit6 days ago

grab-sitePython

A preconfigured web crawler for backing up websites, producing WARC files with a live dashboard and dynamic ignore patterns.

#archiving-tools#spider#archiving

Stars1.6k

Forks157

Last commit1 year ago

archiveweb.pageTypeScript

A browser extension and desktop app for interactive, high-fidelity web archiving directly in the browser.

#replayweb-page#webrecorder#archiving

Stars1.5k

Forks107

Last commit11 days ago

Browsertrix CrawlerTypeScript

A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.

#webrecorder#puppeteer#digital-preservation

Stars1.1k

Forks147

Last commit2 days ago

InterPlanetary Wayback (ipwb)Python

A distributed and persistent web archive replay system that uses IPFS to store and serve WARC files.

#ipfs#service-worker#wayback

Stars654

Forks40

Last commit1 month ago

warcioPython

Streaming WARC/ARC library for fast web archive IO

#web-archives#warc#python

Stars461

Forks70

Last commit1 month ago

warcdbPython

WarcDB is an SQLite-based file format that makes web crawl data easier to share and query.

#data-querying#database#warc

Stars406

Forks10

Last commit2 years ago

WAILRoff

A graphical desktop application that simplifies web archiving by providing a one-click interface to preserve and replay web pages using Heritrix and OpenWayback.

#desktop-application#wayback#digital-preservation

Stars398

Forks38

Last commit1 month ago

ScoopJavaScript

A high-fidelity, browser-based web archiving library and CLI for capturing single web pages with provenance.

#single-page-capture#playwright#digital-preservation

Stars205

Forks12

Last commit10 months ago

SquidwarcJavaScript

A high-fidelity, user-scriptable archival web crawler using Chrome/Chromium to preserve JavaScript-rendered content.

#high-fidelity-preservation#chrome#puppeteer

Stars178

Forks25

Last commit6 years ago

ArchiveSparkScala

An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.

#data-lineage#apache-spark#web-archives

Stars161

Forks19

Last commit9 months ago

SolrWaybackJava

A web application for searching, browsing, and analyzing archived web content (ARC/WARC files) with a Solr backend.

#digital-preservation#open-source-archiving#tomcat

Stars145

Forks28

Last commit8 days ago

FastWARCRust

A collection of robust and fast Python tools for parsing, extracting, and analyzing web archive data, including a high-performance WARC parser.

#cython#batch-processing#content-extraction

Stars144

Forks18

Last commit1 month ago

node-warcJavaScript

A Node.js library for parsing and creating Web ARChive (WARC) files with support for Chrome, Puppeteer, and Electron.

#stream-processing#web-archives#warc-files

Stars104

Forks22

Last commit1 year ago

ChroniclerJavaScript

An offline-first web browser that archives, searches, and crawls websites for personal use.

#offline-browser#personal-archive#warc

Stars92

Forks8

Last commit7 years ago

crauPython

A command-line tool for archiving web pages into WARC files and replaying them locally.

#local-server#crawler#data-preservation

Stars64

Forks10

Last commit3 months ago

warcRust

A Rust library for reading and writing WARC (Web ARChive) files.

#library#storage#warc

Stars60

Forks19

Last commit1 year ago

WarclightRuby

A Rails engine for discovering web archives in WARC and ARC formats with faceted search and advanced discovery options.

#web-archives#rails#digital-libraries

Stars50

Forks9

Last commit3 years ago

A Whirlwind Tour of Common Crawl's Datasets using PythonPython

A Python tutorial demonstrating how to access and process Common Crawl's web archive datasets (WARC, WET, WAT) using tools like warcio, cdx_toolkit, and DuckDB.

#data-indexing#parquet#cdx-index

Stars45

Forks9

Last commit