Question 1

Does grab-site respect robots.txt files?

Accepted Answer

No, grab-site does not respect robots.txt by default because they often block essential content for archival. This can lead to abuse complaints, so it's recommended to monitor crawls closely and be prepared for potential issues.

Question 2

How to monitor multiple crawls with the grab-site dashboard?

Accepted Answer

Start gs-server and access the dashboard at http://127.0.0.1:29000/ in your browser. It displays all active crawls, queued URLs, and real-time progress, allowing centralized management as per the usage instructions.

Question 3

grab-site vs wget for web archiving

Accepted Answer

grab-site is better for archiving due to features like dynamic ignore patterns, duplicate detection, and a live dashboard, while wget is simpler but lacks these tools and can't handle sites with millions of pages as efficiently.

Question 4

How to archive a subreddit with grab-site?

Accepted Answer

Use the reddit ignore set with --igsets=reddit and ensure the URL has correct casing, e.g., grab-site https://www.reddit.com/r/oculus/ --igsets=reddit. Add a trailing slash to avoid crawling all subreddits, as specified in the tips.

Question 5

What ignore sets are available in grab-site?

Accepted Answer

grab-site includes ignore sets for global defaults, forums, reddit, mediawiki, and singletumblr, which are pre-configured regular expressions to skip common junk URLs, detailed in the libgrabsite/ignore_sets directory.

Question 6

How to install grab-site on Windows?

Accepted Answer

Installation on Windows is experimental via Windows Subsystem for Linux (WSL). Follow the Ubuntu instructions within WSL, but note that support is limited and may require troubleshooting, as mentioned in the experimental section.

grab-site

What is grab-site?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions