Open-source self-hosted web archiving tool that saves websites in multiple durable formats like HTML, PDF, and WARC.
ArchiveBox is an open-source, self-hosted web archiving solution that saves websites and online content in multiple durable formats. It takes URLs from browser history, bookmarks, RSS feeds, or services like Pocket and preserves them as HTML, PDF, screenshots, media files, and WARC archives locally. The tool solves the problem of link rot and content disappearance by giving users full control over their personal or organizational web archives.
Journalists, researchers, lawyers, and individuals who need to preserve web content for evidence, reference, or personal archiving. It's also suited for organizations requiring compliance archiving or teams building curated collections of online resources.
Unlike cloud-based archiving services, ArchiveBox is entirely self-hosted, ensuring data privacy and long-term accessibility. It uses battle-tested tools like wget and yt-dlp to extract content reliably and stores everything in standard formats that remain usable for decades without proprietary software.
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Preserves snapshots in HTML, PDF, PNG, WARC, and other standard formats, ensuring content remains accessible for decades without vendor lock-in, as highlighted in the README's focus on longevity.
Supports URLs from browser history, RSS feeds, bookmark services like Pocket, and manual imports, making it adaptable to diverse archiving workflows.
Uses tools like yt-dlp and readability to automatically pull out articles, media files, and Git repositories, enhancing the value of archived pages.
As a self-hosted application, it eliminates reliance on external services, giving users complete sovereignty over their archived data.
Requires installation of multiple dependencies such as Chrome, Node.js, and wget, which can be complex and time-consuming, especially on Windows where native support is limited.
Archiving processes, particularly for media-rich sites, can be slow and memory-intensive due to the use of headless browsers and multiple extractors.
Key features like the REST API are marked as ALPHA, and advanced JavaScript execution during archiving is still planned, limiting some use cases.