A versatile Ruby web spidering library for crawling sites, domains, or specific links with extensive filtering and callback support.
Spidr is a Ruby library for web spidering and crawling that enables developers to programmatically navigate and extract data from websites. It supports crawling single sites, multiple domains, or specific links with configurable filters and callbacks for handling pages, links, and errors.
Ruby developers needing to build web crawlers for tasks like site mapping, link checking, data extraction, or web content analysis.
Developers choose Spidr for its balance of simplicity and power—offering extensive crawling features, fine-grained control via callbacks and filters, and a clean Ruby API without the overhead of larger frameworks.
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports spidering single sites, multiple domains, or specific links with easy-to-use methods like Spidr.site and Spidr.host, as shown in the extensive examples for targeted crawling.
Offers callbacks for every page, URL, link, and failure, enabling custom logic such as building URL maps or handling errors, demonstrated in examples like every_link for origin-destination tracking.
Allows precise control with blacklisting/whitelisting by scheme, host, port, and extension, plus optional robots.txt support, as seen in examples ignoring specific links or ports.
Handles HTTP/HTTPS, various redirects, basic auth, and cookies, simplifying crawling of protected or redirecting links without extra setup, evidenced by features like cookie-protected link following.
Cannot render JavaScript, making it ineffective for modern websites that rely on client-side rendering or dynamic content loading, as it relies solely on HTML parsing with Nokogiri.
Lacks native asynchronous or multi-threaded capabilities, which can bottleneck performance for large-scale crawls, as the README does not mention concurrency or parallel processing features.
Tied to Ruby and requires specific gems like Nokogiri, limiting use to teams comfortable with Ruby and creating dependency hurdles for cross-language projects.