Question 1

How do I set up Web Archive Discovery with Docker for local testing?

Accepted Answer

Navigate to the warc-indexer/src/main/opensearch/os1 directory and run `docker-compose up -d` to start an OpenSearch server, then initialize the index using the provided curl commands in the README.

Question 2

Web Archive Discovery vs NetArchiveSuite: which one should I use?

Accepted Answer

Web Archive Discovery offers reference components and a toolkit, but for active support of the warc-indexer, NetArchiveSuite is the recommended fork. Choose based on whether you need the full ecosystem or just the maintained indexing tool.

Question 3

Does Web Archive Discovery work with the latest version of Elasticsearch?

Accepted Answer

The README states compatibility with Elasticsearch 7.10.2 and older versions with minor modifications, but support for Elasticsearch 8.x is not explicitly mentioned, so testing or adaptation may be required.

Question 4

How can I customize the index schema for my specific web archive needs?

Accepted Answer

Modify the schema.json file used during index creation in OpenSearch/Elasticsearch, but detailed customization guidance is limited in the README; check the wiki for more advanced options.

Question 5

What are the performance implications for indexing large WARC files?

Accepted Answer

It leverages Java and search engine backends for scalability, but specific benchmarks aren't provided; performance depends on hardware, archive size, and OpenSearch/Elasticsearch configuration.

Question 6

Can I use Web Archive Discovery without OpenSearch or Elasticsearch?

Accepted Answer

No, it's designed specifically for integration with these search engines; alternative backends would require significant code changes, as the indexing pipeline is tightly coupled to their APIs.

webarchive-discovery

What is webarchive-discovery?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions