A high-fidelity, browser-based web archiving library and CLI for capturing single web pages with provenance.
Scoop is a high-fidelity web archiving capture engine that uses a browser to capture single web pages into standard archival formats like WARC and WACZ. It solves the problem of preserving web content accurately with complete provenance information, ensuring captures are reliable and verifiable. The tool is available both as a CLI for terminal use and as a JavaScript library for integration into Node.js projects.
Developers, archivists, and researchers who need to programmatically capture and preserve web pages with high fidelity and detailed provenance for digital preservation or analysis.
Developers choose Scoop for its browser-based fidelity, extensive configurability, and built-in support for provenance tracking and WACZ signing, making it a robust alternative to simpler web scrapers for archival-grade captures.
🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses Chromium via Playwright to capture pages exactly as rendered, adhering to a 'no alteration' principle for accurate preservation.
Includes detailed provenance summaries documenting capture context, IP resolution, and timestamps, essential for archival trustworthiness.
Supports WARC, gzipped WARC, and WACZ formats with built-in cryptographic signing via the WACZ Signing specification.
Can embed screenshots, PDF snapshots, DOM snapshots, extracted videos with subtitles, and SSL certificates directly into captures.
Offers extensive options for timeouts, size limits, window dimensions, and asset inclusions, allowing fine-tuned control over captures.
As admitted in the FAQ, Scoop currently cannot capture content behind logins or passwords due to security and isolation concerns.
Requires Node.js 18+, at least 4GB of RAM, and optional system dependencies like curl and python3, making setup heavier than simpler scrapers.
Designed for capturing individual pages; bulk or multi-page archiving would require custom scripting or integration, limiting scalability.
Running in headful mode for better captures may need additional setup like xvfb-run on servers, adding operational overhead.