An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.
Heritrix is an open-source, extensible web crawler developed by the Internet Archive for archival-quality, large-scale web harvesting. It is designed to respect website politeness policies like `robots.txt` and capture web content faithfully for long-term preservation. The project addresses the need for a reliable, scalable tool to archive digital culture for future researchers and generations.
Institutions, researchers, and developers focused on web archiving, digital preservation, and large-scale data collection, such as libraries, archives, and academic projects.
Developers choose Heritrix for its proven scalability, archival-quality focus, and built-in respect for website policies, backed by the Internet Archive's expertise. Its extensible architecture allows customization for specific preservation needs, making it a trusted tool in the digital heritage community.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Designed to capture complete, faithful representations of web content for long-term preservation, as emphasized in its philosophy of treating the web as cultural heritage.
Engineered for web-scale harvesting, capable of handling massive crawl tasks suitable for institutions like the Internet Archive, as highlighted in its key features.
Offers a modular architecture with detailed developer documentation and REST API, allowing customization through new components for specific preservation needs.
Respects robots.txt and META nofollow tags by default, promoting ethical crawling practices with configurable politeness settings, as stressed in the README.
Requires detailed job configuration via XML or properties files, which can be overwhelming for users unfamiliar with Java-based systems, despite documentation.
Being Java-based necessitates JVM installation and management, adding resource and setup overhead compared to lighter, language-agnostic alternatives.
Out-of-the-box, it primarily captures static HTML, struggling with JavaScript-rendered content without integrating additional tools like headless browsers.