A high-fidelity, user-scriptable archival web crawler using Chrome/Chromium to preserve JavaScript-rendered content.
Squidwarc is a high-fidelity archival web crawler that uses Chrome or Chromium to capture JavaScript-rendered web content. It solves the problem of traditional crawlers failing to preserve modern, dynamic websites by executing JavaScript and saving content in WARC format. The project is designed to be user-friendly and scriptable, making it suitable for both personal and research archiving.
Digital archivists, researchers, and individuals who need to preserve JavaScript-heavy websites with accuracy and ease. It is particularly valuable for those who find traditional tools like Heritrix too complex or limited for modern web content.
Developers choose Squidwarc for its ability to handle JavaScript execution out-of-the-box, its scriptable nature for custom archiving needs, and its straightforward setup compared to enterprise-grade alternatives. It bridges the gap between high-fidelity archiving and accessibility.
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses Chrome/Chromium to fully render dynamic content, ensuring accurate archiving of modern websites, directly addressing Heritrix's limitation of no JavaScript execution.
Allows custom scripts via userFns.js for tailored crawling and data extraction per page, enabling precise control for research or specific archiving needs.
Includes Docker and Docker Compose files for containerized setup, simplifying installation and making it accessible without deep technical expertise.
Supports page-only, same-domain, and all-link crawls with adjustable depth, offering versatility for different archiving scenarios as outlined in the features.
Relies on full Chrome instances, which consume significant memory and CPU, making it slower and less suitable for large-scale or high-frequency crawls.
Explicitly states it does not aim to dethrone Heritrix for extensive archival projects, indicating potential bottlenecks in distributed or enterprise environments.
Requires Node.js or Docker to operate; if unavailable, users must resort to alternative tools like WARCreate, as admitted in the README.
Squidwarc is an open-source alternative to the following products: