A Node.js library for parsing and creating Web ARChive (WARC) files with support for Chrome, Puppeteer, and Electron.
node-warc is a Node.js library for parsing and creating Web ARChive (WARC) files, the standard format used for storing web content captures. It solves the problem of programmatically working with web archives by providing parsers for existing WARC files and generators that capture network traffic from browser automation tools.
Developers building web archiving tools, digital preservation systems, or research crawlers who need to work with WARC files programmatically in Node.js environments.
Developers choose node-warc because it offers a consistent API across multiple browser automation tools (Puppeteer, Chrome Remote Interface, Electron) and provides both parsing and creation capabilities in a single, well-documented library.
Parse And Create Web ARChive (WARC) files with node.js
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports multiple browser automation libraries like Puppeteer, Chrome Remote Interface, and Electron, allowing developers to choose based on project needs without vendor lock-in.
Built-in async iteration (recordIterator) and stream transforms (WARCStreamTransform) enable efficient, memory-friendly handling of large WARC files in Node.js 10+ environments.
Automatically detects and handles both gzipped and uncompressed WARC files with dedicated parsers, and offers environment variable control (NODEWARC_WRITE_GZIPPED) for output compression.
Requires integration with third-party browser tools like Puppeteer or Chrome Remote Interface, which adds setup complexity, potential version conflicts, and maintenance overhead.
As a specialized library for web archiving, it has fewer community contributions, plugins, and resources compared to broader web scraping or data processing frameworks.
The README examples are brief, and advanced error handling or edge cases may require digging into source code or external crawler implementations like Squidwarc.