Question 1

JustHTML vs BeautifulSoup for web scraping?

Accepted Answer

JustHTML is better for spec-compliance and built-in security, while BeautifulSoup is more flexible with various backends but slower and less correct for HTML5. Choose JustHTML for correctness and safety, BeautifulSoup for ease with legacy code.

Question 2

How to disable sanitization in JustHTML?

Accepted Answer

Pass sanitize=False when creating a JustHTML object, as shown in the README examples. This is necessary if you're working with pre-sanitized or safe HTML to avoid interference with queries.

Question 3

Does JustHTML support XPath queries?

Accepted Answer

No, JustHTML only supports CSS selectors via its query methods. For XPath functionality, you'd need to integrate with or use libraries like lxml, which have different compliance trade-offs.

Question 4

Is JustHTML safe for user input?

Accepted Answer

Yes, by default it sanitizes HTML on construction using an allowlist approach similar to Bleach. However, review the sanitization rules in the documentation to ensure they match your security needs.

Question 5

How fast is JustHTML compared to lxml?

Accepted Answer

JustHTML is pure Python and slower than lxml, which is C-based and very fast. But it's the fastest pure-Python HTML5 parser available, optimized for typical workloads, not extreme performance.

Question 6

Can JustHTML handle malformed HTML?

Accepted Answer

Yes, it has browser-grade error recovery and passes 100% of the html5lib-tests for tree construction, making it robust for real-world HTML snippets like those with missing tags.

justhtml

What is justhtml?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions