An open-source, extensible, web-scale, archival-quality web crawler from the Internet Archive.
Heritrix is an open-source, extensible web crawler developed by the Internet Archive for large-scale, archival-quality web harvesting. It is designed to capture and preserve digital artifacts with high fidelity, respecting website politeness policies like `robots.txt` and META nofollow tags. The project addresses the need for responsible, scalable tools to archive web content for future research and cultural preservation.
Archivists, researchers, and institutions focused on digital preservation, web archiving, and large-scale data collection who require a robust, configurable crawling solution.
Developers choose Heritrix for its proven scalability in web-scale archiving, its commitment to respectful crawling practices, and its extensible architecture that allows deep customization. As a project from the Internet Archive, it benefits from real-world use in preserving cultural heritage.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Designed to preserve web content with high fidelity for future research, as emphasized in its focus on archival-quality harvesting.
Engineered for massive, internet-wide crawling operations, proven by its use in the Internet Archive's large-scale archiving projects.
Adheres to robots.txt and META nofollow tags to minimize server impact, a core principle highlighted in the README's politeness compliance.
Modular architecture allows for deep customization and integration, supported by comprehensive developer documentation and REST API.
Setting up and tuning crawl jobs requires detailed knowledge and expertise, as indicated by the extensive configuration options and documentation.
Built for large-scale operations, it demands substantial computational resources, making it inefficient and overkill for smaller tasks.
The extensive documentation and modular design imply a significant time investment for new users to master its capabilities effectively.