A preconfigured web crawler for backing up websites, producing WARC files with a live dashboard and dynamic ignore patterns.
grab-site is a web crawler specifically designed for archiving websites. It takes a URL and recursively downloads the site, saving the content in WARC files—the standard format for web archives. It solves the problem of efficiently and completely backing up dynamic websites while providing tools to avoid common pitfalls like infinite bot traps.
Digital archivists, researchers, and developers who need to preserve websites for historical, legal, or research purposes. It's also useful for anyone backing up large or complex sites where standard tools like wget are insufficient.
Developers choose grab-site for its archivist-focused features like the live dashboard, dynamic ignore patterns, and duplicate detection, which provide greater control and reliability than generic crawlers. Its integration with wpull ensures robust, disk-efficient crawling suitable for very large sites.
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
The built-in dashboard provides real-time visibility into all active crawls, showing queued URLs, progress, and status, as highlighted in the usage section where starting gs-server enables monitoring via a web interface.
Users can add or modify ignore rules during a crawl by editing the DIR/ignores file, allowing real-time skipping of junk URLs to prevent infinite traps, as described in the 'Changing ignores during the crawl' section.
Includes extensively tested default ignore sets for common site types like forums, Reddit, and MediaWiki, reducing manual configuration effort, with sets available in the libgrabsite/ignore_sets directory.
Stores the URL queue on disk instead of memory, enabling crawls of sites with up to ~10 million pages, as noted in the README's key features and philosophy for handling large archives.
Installation requires specific Python versions (3.7 or 3.8) and numerous dependencies, with platform-specific steps that are non-trivial, as detailed in the lengthy installation instructions for Ubuntu, macOS, and others.
grab-site ignores robots.txt files by design, which can lead to IP bans and abuse complaints, as warned in the README's warnings section, requiring users to handle ethical and legal risks manually.
For many websites like forums or non-English MediaWiki sites, users must add custom ignore patterns based on the provided tips, increasing the manual effort and expertise required for effective crawling.