A Go package for querying HTML documents using XPath expressions with built-in caching for performance.
htmlquery is an XPath query package for HTML in Go that enables developers to extract data or evaluate expressions from HTML documents. It provides a convenient way to navigate and query HTML structures using standard XPath 1.0/2.0 syntax, making it useful for web scraping and data extraction tasks. The package includes built-in LRU-based caching of compiled XPath expressions to improve query performance.
Go developers who need to perform web scraping, data extraction, or HTML parsing tasks, particularly those familiar with or preferring XPath syntax for querying document structures.
Developers choose htmlquery for its fast, efficient, and easy-to-use XPath implementation with intelligent caching to avoid re-compilation, a clean API, and support for loading HTML from URLs, files, or strings. It is part of a suite of XPath query packages (including xmlquery and jsonquery) by the same author, offering consistency across data formats.
htmlquery is golang XPath package for HTML query.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Implements LRU-based caching of compiled XPath expressions, with benchmarks showing a performance boost from 3162 ns/op to 55.2 ns/op when enabled, reducing re-compilation overhead.
Supports loading HTML from URLs, local files, or strings via LoadURL, LoadDoc, and Parse, providing versatility for different data sources in scraping workflows.
Built on antchfx/xpath, it enables XPath 1.0/2.0 syntax for complex queries like count(//img) and attribute selection, as shown in the evaluation examples.
Offers both Find (panics on invalid XPath) and QueryAll (returns error) methods, allowing developers to choose based on their error management preferences.
Requires familiarity with XPath syntax, which can be more complex and less intuitive than CSS selectors for developers new to it, potentially slowing down onboarding.
Caching is enabled by default using LRU, which may increase memory usage in long-running applications, and disabling it leads to significant performance degradation as per benchmarks.
As a niche package focused on XPath, it has fewer community extensions, tutorials, and third-party tools compared to popular alternatives like goquery.
Cannot parse JavaScript-generated content; it only handles static HTML, requiring additional tools for dynamic pages, which limits its use in modern web scraping.