Question 1

What's the easiest web archiving tool for a small project?

Accepted Answer

For small-scale archives, tools like ArchiveWeb.Page or SingleFile with browser extensions offer user-friendly, graphical interfaces. Check the 'Acquisition' section for options marked 'Stable' and suited for single-page saves.

Question 2

Heritrix vs Browsertrix Crawler: which is better for dynamic websites?

Accepted Answer

Browsertrix Crawler uses a real Chromium browser to handle JavaScript and interactive content, making it superior for dynamic sites. Heritrix is better for large-scale, broad crawls but may miss client-side rendered elements.

Question 3

How to convert HTTrack archives to WARC format?

Accepted Answer

Use the httrack2warc tool listed in the 'Utilities' section. It's a Java-based utility specifically designed for this conversion, preserving structure for replay in standard web archives.

Question 4

Where can I find archived news websites for academic research?

Accepted Answer

Explore the 'Public Data' section for sources like the Internet Archive Wayback Machine or Common Crawl. Additionally, tools in 'Search & Discovery' like SolrWayback can help query these archives.

Question 5

Is there a free alternative to Archive-It for self-hosting?

Accepted Answer

Yes, open-source platforms like Browsertrix and Conifer are listed under 'Self-hostable, Open Source' service providers. These allow full control without vendor fees, though they require technical setup.

Question 6

How to get help with WARC file parsing in Python?

Accepted Answer

Refer to the 'WARC I/O Libraries' section for libraries like warcio or FastWARC, which offer documentation and examples. The community Slack channels listed can also provide coding support.

Awesome Web Archiving

What is Awesome Web Archiving?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions