Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Web Archiving
  3. Browsertrix Crawler

Browsertrix Crawler

AGPL-3.0TypeScriptv1.12.4

A standalone Docker container for high-fidelity, browser-based web archiving crawls using Puppeteer and Brave.

Visit WebsiteGitHubGitHub
1.0k stars136 forks0 contributors

What is Browsertrix Crawler?

Browsertrix Crawler is a standalone browser-based high-fidelity crawling system that runs in a single Docker container. It automates web archiving by controlling Brave Browser windows with Puppeteer to capture web content exactly as rendered, solving the problem of inaccurate or incomplete archival crawls. It is designed for complex, customizable archiving tasks where fidelity to the original web experience is critical.

Target Audience

Digital archivists, researchers, librarians, and developers who need to preserve web content with high accuracy for projects like digital libraries, compliance, or historical records.

Value Proposition

Developers choose Browsertrix Crawler because it offers containerized simplicity with browser-based precision, avoiding the limitations of traditional crawlers. Its use of real browsers via Puppeteer ensures superior fidelity, making it a reliable open-source tool for professional-grade web archiving.

Overview

Run a high-fidelity browser-based web archiving crawler in a single Docker container

Use Cases

Best For

  • Creating high-fidelity archives of dynamic websites with JavaScript-heavy content
  • Running parallel crawls for large-scale web archiving projects
  • Digital preservation initiatives requiring exact visual and functional capture
  • Research projects that need reproducible, browser-accurate web snapshots
  • Self-hosted web archiving solutions with Docker deployment
  • Archiving complex web applications that rely on real browser rendering

Not Ideal For

  • Scraping static websites without JavaScript rendering
  • Projects requiring real-time data extraction with low latency
  • Environments with limited computational resources (e.g., low RAM or CPU)
  • Teams needing flexibility to use non-Chromium browsers like Firefox or Safari

Pros & Cons

Pros

High-Fidelity Capture

Uses Chrome Devtools Protocol with real Brave Browser instances to capture web content exactly as rendered, ensuring accuracy for dynamic sites as highlighted in the README.

Parallel Crawling Efficiency

Manages multiple browser windows simultaneously via Puppeteer, enabling scalable archiving for large projects as described in the key features.

Docker Simplicity

Runs as a single Docker container for easy deployment and isolation, simplifying setup and reproducibility across environments.

Customizable Configurations

Supports complex crawl setups tailored to specific archiving needs, allowing fine-tuning for different website structures.

Cons

Resource Intensive

Running multiple browser instances consumes significant memory and CPU, making it unsuitable for low-end hardware or cloud deployments with tight budgets.

Browser Limitation

Tied to Brave Browser through Puppeteer, limiting compatibility for projects that require other browsers or specific browser versions.

Complex Setup and Learning

Requires expertise in Docker, Puppeteer, and archiving parameters, which can be daunting for users without technical backgrounds or for quick, simple tasks.

Frequently Asked Questions

Quick Stats

Stars1,046
Forks136
Contributors0
Open Issues125
Last commit5 days ago
CreatedSince 2020

Tags

#puppeteer#digital-preservation#devtools-protocol#crawler#warc#headless-browser#crawling#docker#web-archiving#web-crawler#wacz#automation

Built With

P
Puppeteer
D
Docker
C
Chrome DevTools Protocol

Links & Resources

Website

Included in

Web Archiving2.5k
Auto-fetched 1 day ago

Related Projects

ArchiveBoxArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Stars27,580
Forks1,524
Last commit1 day ago
monolithmonolith

⬛️ CLI tool and library for saving complete web pages as a single HTML file

Stars15,130
Forks454
Last commit6 days ago
DiskerNetDiskerNet

💾 dn - offline full-text search and archiving for your Chromium-based browser.

Stars3,904
Forks148
Last commit2 months ago
HeritrixHeritrix

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Stars3,228
Forks789
Last commit6 days ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub