A toolkit for indexing and exploring web archive content from ARC and WARC files using OpenSearch/Elasticsearch.
Web Archive Discovery is a Java-based toolkit for indexing and searching web archive content stored in ARC and WARC files. It extracts data from web archives and indexes it into OpenSearch or Elasticsearch, making archived web pages discoverable through search interfaces. The project solves the challenge of making large-scale web archives searchable and explorable.
Digital archivists, librarians, and developers working with web archiving projects who need to make archived web content searchable. Institutions preserving web archives and researchers analyzing historical web data.
It provides a specialized, open-source solution for indexing web archives with compatibility for both OpenSearch and Elasticsearch. Unlike generic search tools, it understands web archive formats and includes pre-configured schemas optimized for archived content.
Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in this repo is now only for reference. For support and issues of 'warc-indexer', please communicate with NetArchiveSuite.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Includes a Docker Compose file for running OpenSearch locally, simplifying environment setup and testing for web archive indexing.
Supports both OpenSearch and Elasticsearch with a ported Solr schema, enabling easy migration from legacy systems to modern search engines.
Java-based CLI allows batch indexing of individual WARC files into search clusters, ideal for processing large-scale web archives.
Uses GitHub Actions for Maven builds, ensuring reliable deployment and code quality through automated testing.
The core warc-indexer tool is now maintained by NetArchiveSuite, leading to potential confusion and split support channels in this repository.
Requires setting up and managing OpenSearch or Elasticsearch instances via Docker, adding operational overhead and complexity for small teams.
Built on Java, necessitating JVM installation and expertise, which can be a barrier in environments favoring other programming languages.