Question 1

How to set up webarchive-discovery locally with Docker?

Accepted Answer

Run the provided docker-compose file in warc-indexer/src/main/opensearch/os1 to start OpenSearch, then initialize the index using curl commands as detailed in the README for indexing WARC files.

Question 2

Webarchive-discovery vs NetArchiveSuite: which is better for WARC indexing?

Accepted Answer

Webarchive-discovery is tailored for OpenSearch/Elasticsearch integration with a ported Solr schema, but NetArchiveSuite now maintains the warc-indexer tool. Choose based on your search engine preference and need for active support.

Question 3

Can webarchive-discovery work with the latest Elasticsearch versions?

Accepted Answer

The README specifies compatibility with Elasticsearch 7.10.2 and may work with older versions. For Elasticsearch 8.x or newer, significant modifications might be required due to API changes.

Question 4

How to batch index multiple WARC files using this tool?

Accepted Answer

The command-line tool indexes individual files; for batch processing, you need to script it with shell commands or integrate into a pipeline, as it's designed for large-scale, non-real-time archiving workflows.

Question 5

What are the key differences between ARC and WARC support in webarchive-discovery?

Accepted Answer

The tool handles both formats similarly by extracting content for indexing, but schema fields are optimized for web archive structures. Specific format nuances are abstracted in the indexing process.

Question 6

Is webarchive-discovery suitable for real-time web archiving?

Accepted Answer

No, it's built for batch processing of archived files. Real-time indexing would require custom streaming implementations, which aren't natively supported by this toolkit.

here

What is here?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions