A pure Python HTML5 parser with spec-perfect parsing, built-in sanitization, CSS selectors, and zero dependencies.
JustHTML is a pure Python HTML5 parser that provides spec-compliant parsing with browser-grade error recovery. It solves the problem of parsing and manipulating HTML in Python without requiring C extensions or complex dependencies, offering built-in security features like sanitization and a simple query API with CSS selectors.
Python developers who need to parse, sanitize, query, or transform HTML documents, especially those working in environments where installing C extensions is difficult or where security and correctness are priorities.
Developers choose JustHTML for its combination of spec-perfect HTML5 compliance, built-in sanitization, zero dependencies, and pure Python implementation, making it easy to install, debug, and use across different platforms while maintaining robust security and correctness.
A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Passes the official html5lib-tests suite with 100% coverage, ensuring browser-grade error recovery for malformed HTML, as demonstrated in the correctness documentation.
Includes safe-by-default Bleach-style sanitization at construction time, with support for inline CSS rule sanitization, making it secure for user-generated content out of the box.
Provides query() and query_one() methods with familiar CSS selector syntax, simplifying document traversal and manipulation without external dependencies.
Pure Python implementation with no C extensions or system libraries, easy to install and debug across platforms like PyPy and Pyodide, as highlighted in the installation notes.
Admits in the benchmarks that for terabyte-scale parsing, C/Rust parsers like html5ever are better, as it's optimized for common cases but slower than native alternatives.
Sanitization happens before querying, which can be inconvenient when working with safe HTML, requiring explicit disabling with sanitize=False, as noted in the query examples.
Lacks support for XPath queries and other advanced parsing tools found in libraries like lxml, focusing solely on CSS selectors and HTML5 compliance.