A Rust library providing fast linear time and space suffix arrays with full Unicode support.
suffix is a Rust library for building and using suffix arrays—a compact data structure that enables fast substring searches and text analysis. It solves the problem of efficiently finding all occurrences of patterns in text, with specialized algorithms that offer linear time construction and querying. The library distinguishes itself by providing full Unicode support, making it suitable for processing modern multilingual text data.
Rust developers working on text processing, search engines, bioinformatics, or any application requiring efficient substring matching on large text corpora.
Developers choose suffix for its combination of linear-time performance, Unicode correctness, and Rust's safety guarantees. Unlike byte-oriented alternatives, it properly handles UTF-8 text while maintaining competitive speed through optimized algorithms.
Fast suffix arrays for Rust (with Unicode support).
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses the SA-IS algorithm for O(n) construction, making it efficient for large texts as shown in benchmarks comparing to naive O(n^2 log n) methods.
Properly handles UTF-8 text, unlike byte-oriented implementations, ensuring correctness for multilingual data without manual encoding work.
Benchmarks show up to 100x speedup for non-matching queries and rapid existence checks, optimizing search operations once the suffix array is built.
Includes the 'stree' command-line tool to generate suffix tree diagrams, aiding in debugging and educational use, as demonstrated with the banana example.
The library lacks built-in generalized suffix arrays, forcing developers to implement complex workarounds like manual offset management for multiple documents, as admitted in the README.
Building the suffix array requires significant time and memory, making it unsuitable for real-time applications or frequently changing text, despite the linear-time algorithm.
Limited to Rust projects with no cross-language bindings, which can be a barrier for teams using other programming languages or needing broader ecosystem integration.