A configurable and extensible PHP web spider for crawling and scraping websites with support for breadth-first/depth-first traversal, caching, and custom filters.
PHP-Spider is a web crawling and scraping library for PHP that allows developers to programmatically navigate websites, discover links, and extract structured data. It solves the problem of building reliable, configurable web crawlers for tasks like data collection, link validation, and content monitoring. The library provides fine-grained control over traversal algorithms, resource limits, and filtering logic.
PHP developers who need to build web crawlers, scrapers, or automated data extraction tools for websites. It's particularly useful for those requiring custom traversal logic, caching, or integration with existing PHP applications.
Developers choose PHP-Spider for its extensible architecture, comprehensive feature set, and adherence to PHP standards. Unlike simpler scraping scripts, it offers production-ready components like politeness policies, event systems, and persistence handlers while maintaining flexibility through discoverers and filters.
A configurable and extensible PHP web spider
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports both breadth-first and depth-first search algorithms, allowing developers to optimize link discovery based on website structure, as highlighted in the traversal configuration examples.
Enables custom URI discovery using XPath expressions, CSS selectors, or PHP logic, making it adaptable to various HTML parsing needs, as shown in the XPathExpressionDiscoverer usage.
Includes built-in filters for robots.txt compliance and domain limits, plus configurable cache expiration with CachedResourceFilter for efficient incremental crawls, documented in the caching example.
Dispatches events throughout the crawl lifecycle, facilitating custom behavior and real-time statistics collection via components like StatsHandler, as demonstrated in the simple example.
Explicitly does not support JavaScript execution, limiting its effectiveness on modern dynamic websites that rely on client-side rendering for content, a stated limitation in the README.
Requires manual setup of discoverers, filters, and handlers, which can be cumbersome for straightforward scraping tasks compared to simpler libraries or scripts, as seen in the multi-step examples.
Stops processing on HTTP 4XX/5XX errors by default, necessitating custom request handler configuration to continue crawling, which is an admitted quirk requiring extra work as noted in the link checker example.