A high-performance C++ library for Unicode validation and transcoding (UTF-8/16/32, Latin1, Base64) using SIMD instructions.
simdutf is a high-performance C++ library for Unicode text processing. It provides fast validation and transcoding between UTF-8, UTF-16, UTF-32, Latin1, and Base64 encodings, leveraging SIMD instructions to achieve speeds of billions of characters per second. It solves the need for secure and efficient Unicode handling in performance-sensitive applications.
Developers building systems that require fast and safe Unicode processing, such as web browsers, JavaScript runtimes, databases, log processors, and terminal emulators.
Developers choose simdutf for its exceptional speed—often 3–10x faster than ICU on non-ASCII text—and its robust validation, which prevents security issues from malformed Unicode. It's battle-tested in production by Node.js, WebKit, Chromium, Bun, and Cloudflare Workers.
Unicode routines (UTF8, UTF16, UTF32) and Base64: billions of characters per second using SSE2, AVX2, NEON, AVX-512, RISC-V Vector Extension, LoongArch64, POWER. Part of Node.js, WebKit/Safari, Ladybird, Chromium, Cloudflare Workers, Ghostty and Bun.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Benchmarks show it processes billions of characters per second, with Node.js reporting a 364% performance gain in UTF-8 decoding, and it's 3-10x faster than ICU on non-ASCII text.
Widely adopted by major projects like Node.js, Chromium, and Bun, with exhaustive tests and fuzzing ensuring reliability in real-world usage.
Functions like validate_utf8_with_errors provide detailed error codes and positions, allowing precise debugging of malformed inputs without allocations.
All functions are non-allocating and exception-free, minimizing overhead and making them safe for performance-critical paths, as stated in the philosophy section.
No built-in bindings for other languages; developers must manually create interfaces if using non-C++ ecosystems, limiting accessibility for projects in Python, Java, or Rust.
Requires recent compilers and specific assemblers for features like AVX-512, with the README noting that GCC under Windows is buggy and unsupported, complicating cross-platform deployment.
Focuses solely on validation and transcoding, missing advanced operations like normalization or collation that libraries like ICU provide, which may necessitate additional dependencies.