A toolkit for indexing and exploring web archive content from ARC and WARC files using OpenSearch/Elasticsearch.
Web Archive Discovery is a toolkit for indexing and searching web archive content stored in ARC and WARC files. It extracts data from these archives and indexes it into search engines like OpenSearch or Elasticsearch, enabling users to explore and discover historical web materials. The project solves the problem of making large-scale web archives searchable and accessible for research or archival purposes.
Digital archivists, librarians, researchers, and institutions managing web archives who need to build searchable indexes of archived web content.
Developers choose Web Archive Discovery for its specialized focus on web archive indexing, integration with modern search engines, and self-hosted deployment options, providing a flexible and scalable solution for exploring archived web data.
Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in this repo is now only for reference. For support and issues of 'warc-indexer', please communicate with NetArchiveSuite.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Tailored specifically for ARC and WARC files, offering optimized extraction and indexing for historical web data, as highlighted in its core purpose of making archive contents explorable.
Integrates seamlessly with OpenSearch and Elasticsearch, enabling scalable search capabilities, and includes a Docker Compose setup for easy local development and testing.
Ports Solr schemas to OpenSearch with minimal adjustments, easing migration for users transitioning from Solr-based systems, as noted in the README's compatibility details.
Provides a Java-based CLI tool for batch indexing WARC files, allowing automation and customization in large-scale archive processing workflows.
The core warc-indexer tool is now supported by NetArchiveSuite, making this repository potentially outdated for active development and leading to confusion in issue tracking.
Requires configuring Docker, OpenSearch/Elasticsearch, and Java environments, which can be daunting for users without prior experience in these technologies.
Relies on a wiki for documentation, which may lack detailed guides or updates, as evidenced by the minimal instructions in the README for advanced use cases.