Web Archiving

The "Awesome Web Archiving" project is a comprehensive collection of resources dedicated to the preservation of web content for future generations. Web archiving involves capturing and storing web pages and their associated data to ensure that digital history is not lost over time. This list includes tools, software, best practices, case studies, and community initiatives that focus on web archiving. It is valuable for researchers, historians, developers, and anyone interested in digital preservation, providing insights into methodologies and technologies used in the field. Users can discover innovative solutions and contribute to the ongoing effort to safeguard our digital heritage.

web-archivingdigital-preservationinternet-historyarchival-toolsweb-crawlersdata-collectionresearch-resources

RSS View on GitHub

2.5k stars184 forks0 contributorsUpdated

Training/Documentation

14 projects

What is a web archive?

youtu.be

Wikipedia's List of Web Archiving Initiatives

en.wikipedia.org

Glossary of Archive-It and Web Archiving Terms

support.archive-it.org

The Web Archiving Lifecycle Model

archive-it.org

Retrieving and Archiving Information from Websites by Wael Eskandar and Brad Murray

kit.exposingtheinvisible.org

IIPC and DPC Training materials: module for beginners (8 sessions)

netpreserve.org

UNT Web Archiving Course

A comprehensive open educational resource for teaching web archiving concepts, practices, and tools.

232 years ago

Continuing Education to Advance Web Archiving (CEDWARC)

cedwarc.github.io

A Whirlwind Tour of Common Crawl's Datasets using Python

A Python tutorial demonstrating how to access and process Common Crawl's web archive datasets (WARC, WET, WAT) using tools like warcio, cdx_toolkit, and DuckDB.

Python451 month ago

warc-specifications

iipc.github.io

offical ISO 28500 WARC specification homepage

bibnum.bnf.fr

GLAM Workbench: Web Archives

glam-workbench.github.io

Archives Unleashed Toolkit documentation

aut.docs.archivesunleashed.org

Tutorial for Humanities researchers about how to explore Arquivo.pt

sobre.arquivo.pt

Resources for Web Publishers

2 projects

Definition of Web Archivability

nullhandle.org

Archive Ready

archiveready.com

Tools & Software

2 projects

Comparison of web archiving software

A compilation of research materials on data resilience, interactivity, and related topics for the Data Together community.

1007 years ago

Awesome Website Change Monitoring

A comprehensive curated list of open-source and hosted tools for monitoring and detecting changes on websites.

5149 months ago

Acquisition

41 projects

ArchiveBox

Open-source self-hosted web archiving tool that saves websites in multiple durable formats like HTML, PDF, and WARC.

Python27,98511 hours ago

A Python tool to automatically archive web content (videos, images, social media) from Google Sheets and other sources in a secure, verifiable way.

Python1,0969 days ago

Browsertrix Crawler

A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.

TypeScript1,0871 day ago

Brozzler

Python80912 days ago

Cairn

NPM package and CLI tool for saving web pages as single HTML files, implemented in TypeScript.

TypeScript522 days ago

Chronicler

An offline-first web browser that archives, searches, and crawls websites for personal use.

JavaScript927 years ago

Community Archive

community-archive.org

crau

A command-line tool for archiving web pages into WARC files and replaying them locally.

Offline full-text search and archiving tool for Chromium-based browsers that saves and indexes every page you visit.

JavaScript3,9023 months ago

F(b)arc

A command-line tool and Python library for archiving Facebook data via the Graph API, supporting recursive retrieval of nodes and edges.

Python788 years ago

freeze-dry

TypeScript3033 years ago

grab-site

A preconfigured web crawler for backing up websites, producing WARC files with a live dashboard and dynamic ignore patterns.

Python1,6011 year ago

Heritrix

An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.

Java3,2795 days ago

Heritrix Q&A

An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.

Java3,2795 days ago

Heritrix Walkthrough

A virtual machine and walkthrough for setting up and using the Heritrix web crawler for web archiving.

Shell1010 years ago

html2warc

A Python script that converts offline web resources into a single WARC file for archiving.

CLI tool and library for saving complete web pages as a single, self-contained HTML file.

Rust15,3511 month ago

Obelisk

A Go package and CLI tool that saves web pages as single HTML files with all assets embedded.

Go3185 months ago

Scoop

A high-fidelity, browser-based web archiving library and CLI for capturing single web pages with provenance.

JavaScript20510 months ago

SingleFile

JavaScript21,8884 months ago

SiteStory

mementoweb.github.io

Social Feed Manager

gwu-libraries.github.io

Squidwarc

A high-fidelity, user-scriptable archival web crawler using Chrome/Chromium to preserve JavaScript-rendered content.

JavaScript1786 years ago

StormCrawler

stormcrawler.net

twarc

A command line tool and Python library for collecting and archiving Twitter JSON data via the Twitter API.

Python1,3938 months ago

WAIL

A graphical desktop application that simplifies web archiving by providing a one-click interface to preserve and replay web pages using Heritrix and OpenWayback.

A dockerized, queued web archiver using Chrome headless to create high-fidelity WARC files from URLs.

Python622 years ago

Wayback

A privacy-focused web archiving tool with an IM-style interface that captures pages to multiple archival services.

Go2,21911 hours ago

Waybackpy

A Python package and CLI tool for interacting with the Wayback Machine's Save, CDX, and Availability APIs.

Python5982 years ago

Web2Warc

A customizable Scala crawler for creating personal web archives in WARC/CDX format.

Replay

6 projects

InterPlanetary Wayback (ipwb)

A distributed and persistent web archive replay system that uses IPFS to store and serve WARC files.

Python6541 month ago

OpenWayback

Legacy web archive replay engine for accessing historical web content from WARC files.

Java5212 years ago

PYWB

JavaScript1,6803 months ago

Converts WARC web archive files to static HTML with relative links for offline browsing or rehosting.

Java5910 months ago

Search & Discovery

13 projects

hyphe

A research-driven web crawler for building and analyzing curated web corpora as networks of web entities.

JavaScript3842 months ago

Mink

A Chrome extension that integrates live web browsing with archived copies using the Memento protocol.

JavaScript5910 months ago

PANDORÆ

A desktop application for retrieving, normalizing, and exploring document collections from scientific, institutional, and web sources without coding.

JavaScript161 month ago

playback

A command-line tool to playback archived webpages from the Wayback Machine using GitHub as a source.

A toolkit for indexing and exploring web archive content from ARC and WARC files using OpenSearch/Elasticsearch.

Java1338 months ago

Shine

A prototype web archive exploration UI powered by Solr for searching and browsing archived web content.

JavaScript436 years ago

SolrWayback

A web application for searching, browsing, and analyzing archived web content (ARC/WARC files) with a Solr backend.

Java1457 days ago

Warclight

A Rails engine for discovering web archives in WARC and ARC formats with faceted search and advanced discovery options.

Ruby503 years ago

Wasp

A personal web archive and search system that runs in a Docker container, allowing you to browse and search archived web pages.

Java2820 days ago

here

A toolkit for indexing and exploring web archive content from ARC and WARC files using OpenSearch/Elasticsearch.

Java1338 months ago

Utilities

28 projects

ArchiveTools

A Python toolkit for extracting, filtering, and analyzing data from web archives, JSON files, and imageboards.

Python794 years ago

bagnabit2warc

Converts bag-nabit datasets from ZIP archives into full-content WARC files for web archiving.

Python14 months ago

cdx-toolkit

pypi.org

duckdb-web-archive-cdx

A DuckDB extension to query web archive CDX APIs (Wayback Machine & Common Crawl) directly from SQL with smart query pushdown.

C++2122 days ago

Go Get Crawl

A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.

Go1831 year ago

gowarcserver

A BadgerDB-based server for indexing and serving WARC file contents with CDX capture index support.

A CLI tool to test URL availability and retrieve Internet Archive snapshots, with output in JSON, CSV, or BoltDB.

Go1015 days ago

Internet Archive Library

Python1,88313 days ago

httrack2warc

Converts HTTrack website crawls into standardized WARC files for web archiving and preservation.

Java341 year ago

MementoMap

A framework for profiling web archives to summarize their holdings using compact SURT-based maps.

Python125 years ago

MemGator

A portable concurrent Memento aggregator CLI and server for retrieving archived web pages from multiple sources.

Go803 months ago

node-cdxj

A Node.js library for parsing CDXJ files produced by web archiving tools like Pywb.

JavaScript29 years ago

OutbackCDX

A high-performance, RocksDB-based capture index server for web archives, supporting OpenWayback and PyWb protocols.

Java4311 days ago

py-wasapi-client

A Python command-line client for downloading web archive (WARC) files from Archive-It and Webrecorder WASAPI Data Transfer APIs.

Extracts hyperlinks from files using Apache Tika for batch processing and web archiving workflows.

HTML111 year ago

wasapi-downloader

A Java command-line application for downloading web archive (WARC) files from the WASAPI (Web Archiving Systems API).

WarcDB is an SQLite-based file format that makes web crawl data easier to share and query.

Python4062 years ago

warcbench

A resilient and configurable Python tool for exploring, analyzing, transforming, and extracting data from WARC (Web ARChive) files.

Python2211 months ago

warcdedupe

gitlab.com

warc-safe

A Python tool that scans WARC web archive files for viruses and NSFW content using AI and antivirus detection.

Python185 days ago

WarcPartitioner

A Hadoop/MapReduce tool that splits and partitions web archive records in (W)ARC files by MIME type and year.

Java19 years ago

warcrefs

Java-based web archive deduplication tool that identifies duplicates and converts them to reference records in WARC files.

Java107 years ago

webarchive-indexing

MapReduce tools for bulk indexing of web archive WARC/ARC files into ZipNum sharded CDX clusters on Hadoop, EMR, or local systems.

Python468 years ago

wikiteam

A set of Python tools for downloading and preserving wikis, including MediaWiki wikis and Wikimedia projects.

Python8566 months ago

WARC I/O Libraries

13 projects

FastWARC

A collection of robust and fast Python tools for parsing, extracting, and analyzing web archive data, including a high-performance WARC parser.

Rust1441 month ago

HadoopConcatGz

A splitable Hadoop InputFormat for processing concatenated GZIP files and web archive (*.warc.gz) data efficiently in distributed systems.

Java98 years ago

jwarc

A Java library for reading and writing WARC files with a typed, extensible API and high-performance NIO-based parsing.

Java6023 days ago

Jwat-Tools

A command-line tool for performing various gzip, ARC, WARC, and XML tasks on web archive files.

Java52 years ago

node-warc

A Node.js library for parsing and creating Web ARChive (WARC) files with support for Chrome, Puppeteer, and Electron.

JavaScript1041 year ago

Sparkling

A Scala/Spark library for efficient processing, extraction, and derivation of web archive data (CDX/WARC).

Scala172 months ago

Unwarcit

A Python CLI tool for extracting and validating WARC and WACZ web archive files.

Python134 years ago

warc

A Rust library for reading and writing WARC (Web ARChive) files.

A command-line tool and Rust library for handling Web ARChive (WARC) files.

Python command-line tools and libraries for handling, validating, and converting WARC and ARC web archive files.

Python17611 months ago

webarchive

A Go library for reading and parsing WARC and ARC web archive formats with specialized utilities for web archiving workflows.

Go203 years ago

Analysis

9 projects

Archives Research Compute Hub

A job server for distributed compute analysis of web archive (WARC) collections.

Scala203 months ago

ArchiveSpark

An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.

Scala1619 months ago

Archives Unleashed Notebooks

Example notebooks for analyzing web archives using the Archives Unleashed Toolkit.

Jupyter Notebook263 years ago

Archives Unleashed Toolkit

An open-source toolkit for analyzing web archives at scale using Apache Spark.

Scala1587 months ago

Common Crawl Columnar Index

commoncrawl.org

Common Crawl Web Graph

commoncrawl.org

Common Crawl Jupyter notebooks

A collection of Jupyter notebooks for analyzing Common Crawl web archive data using columnar indexes and webgraph datasets.

Jupyter Notebook6617 days ago

Tweet Archvies Unleashed Toolkit

An open-source toolkit for analyzing line-oriented JSON Twitter archives using Apache Spark.

Scala104 months ago

Web Data Commons

webdatacommons.org

Quality Assurance

12 projects

Chrome Check My Links

chromewebstore.google.com

Chrome link checker

chromewebstore.google.com

Chrome link gopher

chromewebstore.google.com

Chrome Open Multiple URLs

chromewebstore.google.com

Chrome Revolver

chromewebstore.google.com

FlameShot

A powerful, open-source screenshot tool with built-in annotation and editing capabilities for Linux, Windows, and macOS.

Windows Snipping Tool

support.microsoft.com

WineBottler

winebottler.kronenberg.org

xDoTool

A command-line tool for simulating keyboard/mouse input and automating window management on X11 systems.

C3,83020 days ago

Xenu

home.snafu.de

Curation

1 projects

Zotero Robust Links Extension

robustlinks.mementoweb.org

Other Awesome Lists

3 projects

Awesome Memento

A curated list of software, literature, and resources for the Memento protocol (RFC7089) enabling time-based access to archived web content.

1211 month ago

The WARC Ecosystem

archiveteam.org

The Web Crawl section of COPTR

coptr.digipres.org

Blogs and Scholarship

7 projects

IIPC Blog

netpreserveblog.wordpress.com

Web Archiving Roundtable

webarchivingrt.wordpress.com

Common Crawl Foundation Blog

commoncrawl.org

Related Awesome Lists

📦

Awesome

The "Awesome" project is a comprehensive exploration of recursion, a fundamental programming technique where a function calls itself to solve problems. This list covers various aspects of recursion, including visual illustrations, examples, and explanations that help demystify the concept. It is beneficial for beginners looking to grasp the basics of recursion, as well as experienced developers seeking to refine their understanding or find new applications for recursive solutions. With a variety of resources available, users can deepen their knowledge and enhance their coding skills through practical examples and insightful discussions.

452.0k

📦

Self Hosted

The "Awesome Self Hosted" project is a curated collection of software applications that can be hosted on your own servers, providing users with full control over their data and services. This list encompasses a wide range of categories, including web applications, databases, file storage solutions, content management systems, and development tools. It is particularly beneficial for developers, system administrators, and privacy-conscious users who seek alternatives to cloud services. By leveraging self-hosted solutions, users can enhance their security, customize their environments, and reduce reliance on third-party providers. Explore this collection to discover powerful tools that empower you to take charge of your digital landscape.

284.1k

📦

Free for Developers

The "Awesome Free for Developers" project is a curated collection of free tools, services, and resources available for developers. This list covers a wide range of categories including cloud services, APIs, software development tools, design resources, and educational platforms that offer free tiers or completely free access. It is particularly beneficial for developers, startups, and students who are looking to leverage high-quality resources without incurring costs. By providing access to these valuable tools, the project empowers users to enhance their projects, improve their skills, and innovate without financial barriers. Explore this collection to discover what you can utilize for your next development endeavor.

120.5k

📦

Beginner-Friendly Projects

The "Awesome Beginner-Friendly Projects" project is a curated collection of coding projects aimed at helping novice developers enhance their programming skills through practical experience. This list includes a variety of beginner-friendly projects across different programming languages, covering categories such as web development, game development, data analysis, and mobile applications. With resources ranging from project ideas and tutorials to sample code and community support, this list is invaluable for beginners looking to build confidence and competence in coding. Whether you're just starting or looking to practice your skills, you'll find engaging projects that inspire creativity and learning.

84.2k

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub