Web Archiving

80 projects

Showing 36 of 80 projects

ArchiveBoxPython

Open-source self-hosted web archiving tool that saves websites in multiple durable formats like HTML, PDF, and WARC.

#pinboard#data-backup#wget

CLI tool and library for saving complete web pages as a single, self-contained HTML file.

#make-the-internet-great-again#e-hoarding#cli-tool

Stars15.4k

Forks468

Last commit2 months ago

ShioriGo

A simple, self-hosted bookmark manager with offline archiving, built as a Pocket alternative.

#hacktoberfest#offline-reading#portable

Stars11.5k

Forks624

Last commit14 days ago

DiskerNetJavaScript

Offline full-text search and archiving tool for Chromium-based browsers that saves and indexes every page you visit.

#archiver#web-browsing#privacy-tools

Stars3.9k

Forks147

Last commit3 months ago

dnJavaScript

Offline full-text search and archiving tool for Chromium-based browsers that saves and indexes every page you visit.

#archiver#web-browsing#privacy-tools

Stars3.9k

Forks147

Last commit3 months ago

Heritrix Q&AJava

An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.

#webcrawling#digital-preservation#warc

Stars3.3k

Forks792

Last commit9 days ago

HeritrixJava

An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.

#webcrawling#digital-preservation#warc

Stars3.3k

Forks792

Last commit9 days ago

Web Archiving

A curated list of resources, tools, and services for web archiving, from acquisition and replay to analysis and community.

#data-curation#digital-libraries#digital-preservation

Stars2.6k

Forks196

Last commit2 months ago

Awesome Web Archiving

A curated list of resources, tools, and services for web archiving, from acquisition and replay to analysis and community.

#data-curation#digital-libraries#digital-preservation

Stars2.6k

Forks196

Last commit2 months ago

WaybackGo

A privacy-focused web archiving tool with an IM-style interface that captures pages to multiple archival services.

#ipfs#privacy-tools#digital-preservation

Stars2.2k

Forks86

Last commit2 days ago

PYWBJavaScript

Core Python Web Archiving Toolkit for replay and recording of web archives

#web-archives#wayback#python

Stars1.7k

Forks238

Last commit3 months ago

grab-sitePython

A preconfigured web crawler for backing up websites, producing WARC files with a live dashboard and dynamic ignore patterns.

#archiving-tools#spider#archiving

Stars1.6k

Forks157

Last commit1 year ago

archiveweb.pageTypeScript

A browser extension and desktop app for interactive, high-fidelity web archiving directly in the browser.

#replayweb-page#webrecorder#archiving

Stars1.5k

Forks107

Last commit15 days ago

Auto ArchiverPython

A Python tool to automatically archive web content (videos, images, social media) from Google Sheets and other sources in a secure, verifiable way.

#service#image-archiving#python

Stars1.1k

Forks107

Last commit2 days ago

Browsertrix CrawlerTypeScript

A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.

#webrecorder#puppeteer#digital-preservation

A set of Python tools for downloading and preserving wikis, including MediaWiki wikis and Wikimedia projects.

#backup-tools#wikiteam#cultural-heritage

Stars857

Forks174

Last commit6 months ago

InterPlanetary Wayback (ipwb)Python

A distributed and persistent web archive replay system that uses IPFS to store and serve WARC files.

#ipfs#service-worker#wayback

Stars654

Forks40

Last commit1 month ago

WaybackpyPython

A Python package and CLI tool for interacting with the Wayback Machine's Save, CDX, and Availability APIs.

#archive-webpages#python-library#archive-webpage

Stars599

Forks40

Last commit2 years ago

OpenWaybackJava

Legacy web archive replay engine for accessing historical web content from WARC files.

#memento-protocol#digital-preservation#internet-archive

Stars522

Forks313

Last commit2 years ago

Awesome Website Change Monitoring

A comprehensive curated list of open-source and hosted tools for monitoring and detecting changes on websites.

#change-detection#diffing#awesome-list

Stars514

Forks41

Last commit9 months ago

warcioPython

Streaming WARC/ARC library for fast web archive IO

#web-archives#warc#python

Stars462

Forks70

Last commit1 month ago

archivenowPython

A Tool To Push Web Resources Into Web Archives

#internet-archive#web-archiving

Stars434

Forks40

Last commit2 years ago

warcdbPython

WarcDB is an SQLite-based file format that makes web crawl data easier to share and query.

#data-querying#database#warc

Stars406

Forks10

Last commit2 years ago

WAILRoff

A graphical desktop application that simplifies web archiving by providing a one-click interface to preserve and replay web pages using Heritrix and OpenWayback.

#desktop-application#wayback#digital-preservation

Stars398

Forks38

Last commit1 month ago

ObeliskGo

A Go package and CLI tool that saves web pages as single HTML files with all assets embedded.

#hacktoberfest#single-file-export#html-embedding

Stars318

Forks25

Last commit5 months ago

ScoopJavaScript

A high-fidelity, browser-based web archiving library and CLI for capturing single web pages with provenance.

#single-page-capture#playwright#digital-preservation

Stars205

Forks12

Last commit10 months ago

Go Get CrawlGo

A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.

#crawler#go-library#wayback-machine

Stars183

Forks17

Last commit1 year ago

SquidwarcJavaScript

A high-fidelity, user-scriptable archival web crawler using Chrome/Chromium to preserve JavaScript-rendered content.

#high-fidelity-preservation#chrome#puppeteer

Stars178

Forks25

Last commit6 years ago

warctoolsPython

Python command-line tools and libraries for handling, validating, and converting WARC and ARC web archive files.

#web-crawling#command-line-tools#python-library

Stars176

Forks33

Last commit11 months ago

ArchiveSparkScala

An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.

#data-lineage#apache-spark#web-archives

Stars161

Forks19

Last commit9 months ago

SolrWaybackJava

A web application for searching, browsing, and analyzing archived web content (ARC/WARC files) with a Solr backend.

#digital-preservation#open-source-archiving#tomcat

Stars145

Forks28

Last commit1 day ago

webarchive-discoveryJava

A toolkit for indexing and exploring web archive content from ARC and WARC files using OpenSearch/Elasticsearch.

#digital-preservation#java#warc-indexing

Stars133

Forks26

Last commit8 months ago

hereJava

A toolkit for indexing and exploring web archive content from ARC and WARC files using OpenSearch/Elasticsearch.

#digital-preservation#java#warc-indexing

Stars133

Forks26

Last commit8 months ago

Awesome Memento

A curated list of software, literature, and resources for the Memento protocol (RFC7089) enabling time-based access to archived web content.

#memento-protocol#web-archive-replay#digital-libraries

Stars121

Forks12

Last commit2 months ago

node-warcJavaScript

A Node.js library for parsing and creating Web ARChive (WARC) files with support for Chrome, Puppeteer, and Electron.

#stream-processing#web-archives#warc-files

Stars105

Forks22

Last commit1 year ago

Comparison of web archiving software

A compilation of research materials on data resilience, interactivity, and related topics for the Data Together community.

#citizen-science#algorithmic-fairness#open-research

Stars100

Forks10

Last commit7 years ago

Page 1 of 3

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub