Error-recovering streaming HTML5 and XML parsers for OCaml with lazy, non-blocking, and one-pass processing.
Markup.ml is a pair of OCaml parsers that implement the HTML5 and XML specifications with error recovery. It provides streaming, lazy, and non-blocking parsing capabilities, allowing developers to process web content efficiently without building in-memory document representations. The parsers automatically detect character encodings and output UTF-8, making them suitable for web scraping and document processing tasks.
OCaml developers working with web content, such as those building web scrapers, document processors, or tools that need to parse HTML/XML with high conformance and error tolerance.
Developers choose Markup.ml for its spec-compliant error recovery, streaming capabilities, and minimal memory footprint. Unlike many parsers that build DOM trees, it emits signals lazily and supports both synchronous and asynchronous workflows through Lwt integration.
Error-recovering streaming HTML5 and XML parsers
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Automatically corrects malformed HTML/XML while reporting errors, as shown in the README example where it recovers from unmatched tags like <em>.
Processes partial input as it arrives and parses only when signals are requested, enabling efficient handling of large or real-time data streams.
Supports non-blocking I/O through Lwt, with dedicated modules for concurrent use, making it suitable for OCaml web services and scraping.
Operates in one pass without buffering the entire document, limiting memory usage to small lookahead, ideal for large files.
Does not construct a document tree, requiring additional libraries like Lambda Soup for common tasks such as element selection or manipulation.
The stream-based functional interface and OCaml-specific patterns, such as lazy signals, can be challenging for developers new to this paradigm.
Being in version 0.x.x means breaking changes may occur between minor releases, as per semantic versioning, which could affect production code.