A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.
Browsertrix Crawler is a standalone browser-based high-fidelity crawling system that runs in a single Docker container. It automates web archiving by controlling Brave Browser windows with Puppeteer to capture web content exactly as rendered, solving the problem of inaccurate or incomplete archival crawls. It is designed for complex, customizable archiving tasks where fidelity to the original web experience is critical.
Digital archivists, researchers, librarians, and developers who need to preserve web content with high accuracy for projects like digital libraries, compliance, or historical records.
Developers choose Browsertrix Crawler because it offers containerized simplicity with browser-based precision, avoiding the limitations of traditional crawlers. Its use of real browsers via Puppeteer ensures superior fidelity, making it a reliable open-source tool for professional-grade web archiving.
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Uses Chrome Devtools Protocol with real Brave Browser instances to capture web content exactly as rendered, ensuring accuracy for dynamic sites as highlighted in the README.
Manages multiple browser windows simultaneously via Puppeteer, enabling scalable archiving for large projects as described in the key features.
Runs as a single Docker container for easy deployment and isolation, simplifying setup and reproducibility across environments.
Supports complex crawl setups tailored to specific archiving needs, allowing fine-tuning for different website structures.
Running multiple browser instances consumes significant memory and CPU, making it unsuitable for low-end hardware or cloud deployments with tight budgets.
Tied to Brave Browser through Puppeteer, limiting compatibility for projects that require other browsers or specific browser versions.
Requires expertise in Docker, Puppeteer, and archiving parameters, which can be daunting for users without technical backgrounds or for quick, simple tasks.
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
⬛️ CLI tool and library for saving complete web pages as a single HTML file
💾 dn - offline full-text search and archiving for your Chromium-based browser.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.