Showing 15 of 15 projects
An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.
An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.
A preconfigured web crawler for backing up websites, producing WARC files with a live dashboard and dynamic ignore patterns.
A browser extension and desktop app for interactive, high-fidelity web archiving directly in the browser.
A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.
A distributed and persistent web archive replay system that uses IPFS to store and serve WARC files.
WarcDB is an SQLite-based file format that makes web crawl data easier to share and query.
A graphical desktop application that simplifies web archiving by providing a one-click interface to preserve and replay web pages using Heritrix and OpenWayback.
A high-fidelity, browser-based web archiving library and CLI for capturing single web pages with provenance.
A high-fidelity, user-scriptable archival web crawler using Chrome/Chromium to preserve JavaScript-rendered content.
An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.
A web application for searching, browsing, and analyzing archived web content (ARC/WARC files) with a Solr backend.
A collection of robust and fast Python tools for parsing, extracting, and analyzing web archive data, including a high-performance WARC parser.
A Node.js library for parsing and creating Web ARChive (WARC) files with support for Chrome, Puppeteer, and Electron.
An offline-first web browser that archives, searches, and crawls websites for personal use.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.