Showing 36 of 36 projects
Open-source self-hosted web archiving tool that saves websites in multiple durable formats like HTML, PDF, and WARC.
CLI tool and library for saving complete web pages as a single, self-contained HTML file.
A simple, self-hosted bookmark manager with offline archiving, built as a Pocket alternative.
Offline full-text search and archiving tool for Chromium-based browsers that saves and indexes every page you visit.
Offline full-text search and archiving tool for Chromium-based browsers that saves and indexes every page you visit.
An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.
An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.
A curated list of resources, tools, and services for web archiving, from acquisition and replay to analysis and community.
A curated list of resources, tools, and services for web archiving, from acquisition and replay to analysis and community.
A privacy-focused web archiving tool with an IM-style interface that captures pages to multiple archival services.
A preconfigured web crawler for backing up websites, producing WARC files with a live dashboard and dynamic ignore patterns.
A browser extension and desktop app for interactive, high-fidelity web archiving directly in the browser.
A Python tool to automatically archive web content (videos, images, social media) from Google Sheets and other sources in a secure, verifiable way.
A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.
A set of Python tools for downloading and preserving wikis, including MediaWiki wikis and Wikimedia projects.
A distributed and persistent web archive replay system that uses IPFS to store and serve WARC files.
A Python package and CLI tool for interacting with the Wayback Machine's Save, CDX, and Availability APIs.
Legacy web archive replay engine for accessing historical web content from WARC files.
A comprehensive curated list of open-source and hosted tools for monitoring and detecting changes on websites.
WarcDB is an SQLite-based file format that makes web crawl data easier to share and query.
A graphical desktop application that simplifies web archiving by providing a one-click interface to preserve and replay web pages using Heritrix and OpenWayback.
A Go package and CLI tool that saves web pages as single HTML files with all assets embedded.
A high-fidelity, browser-based web archiving library and CLI for capturing single web pages with provenance.
A Go tool and library for downloading URLs and files from Common Crawl and Wayback Machine web archives.
A high-fidelity, user-scriptable archival web crawler using Chrome/Chromium to preserve JavaScript-rendered content.
Python command-line tools and libraries for handling, validating, and converting WARC and ARC web archive files.
An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.
A web application for searching, browsing, and analyzing archived web content (ARC/WARC files) with a Solr backend.
A toolkit for indexing and exploring web archive content from ARC and WARC files using OpenSearch/Elasticsearch.
A toolkit for indexing and exploring web archive content from ARC and WARC files using OpenSearch/Elasticsearch.
A curated list of software, literature, and resources for the Memento protocol (RFC7089) enabling time-based access to archived web content.
A Node.js library for parsing and creating Web ARChive (WARC) files with support for Chrome, Puppeteer, and Electron.
A compilation of research materials on data resilience, interactivity, and related topics for the Data Together community.
An offline-first web browser that archives, searches, and crawls websites for personal use.
A portable concurrent Memento aggregator CLI and server for retrieving archived web pages from multiple sources.
A Python toolkit for extracting, filtering, and analyzing data from web archives, JSON files, and imageboards.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.