A Unicode-aware lexer generator for OCaml that embeds lexer specifications directly in OCaml source files.
sedlex is a lexer generator for OCaml designed specifically for Unicode text processing. It allows developers to write lexers using OCaml's native pattern matching syntax directly within source files, supporting full Unicode character sets and arbitrary input encodings.
OCaml developers building parsers or compilers that need to process Unicode text, especially those working with internationalization, multilingual data, or formal language tools.
Developers choose sedlex over ocamllex for its robust Unicode support and seamless integration with OCaml's syntax and tooling. Its PPX-based implementation avoids separate lexer definition files and works with standard OCaml editors and build systems.
An OCaml lexer generator for Unicode
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Handles Unicode codepoints and multiple encodings like UTF-8 and UTF-16, making it ideal for international text processing, unlike ocamllex which is byte-based.
Lexer specifications use OCaml's pattern matching directly in source files, avoiding separate .mll files and ensuring editor compatibility without new syntax.
Implemented as a PPX rewriter, it integrates smoothly with modern OCaml build systems like dune and works with standard tooling and editors.
Supports named regexps for reusability, sub-match capture with 'as' patterns, and predefined Unicode categories for complex lexer rules.
Requires OCaml's PPX infrastructure, which can complicate setup in environments without findlib or with custom build systems, adding a learning curve.
Unicode-aware lexing introduces slower processing compared to ocamllex for ASCII-only text, impacting performance on large, simple inputs.
Needs manual wrapping to interface with standard parser generators like ocamlyacc and Menhir, as shown in README examples, adding complexity.