An efficient command-line tool and library for filtering duplicate lines from textual input, optimized for speed and memory usage.
Runiq is a Rust-based command-line utility and library that filters duplicate lines from textual input with high efficiency in both time and memory. It solves the problem of deduplicating large datasets by offering multiple filtering algorithms optimized for different trade-offs between speed, memory usage, and accuracy. It serves as a versatile alternative to traditional Unix tools like `uniq` and `sort -u`.
System administrators, data engineers, and developers processing large log files, datasets, or streams who need efficient deduplication with control over performance characteristics. It is also suitable for Rust developers seeking a programmatic deduplication library.
Developers choose Runiq for its configurable filters that allow precise tuning for specific use cases, such as the memory-efficient `compact` filter using a Bloom Filter or the high-speed `sorted` filter for pre-sorted data. It offers competitive or superior performance and lower memory usage compared to alternatives like `uniq`, `sort -u`, `uq`, and `huniq`.
An efficient way to filter duplicate lines from input, à la uniq.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Offers quick, simple, sorted, and compact filters, each optimized for specific trade-offs between speed, memory, and accuracy, as detailed in the README.
Benchmarks show competitive or superior performance with lower memory usage compared to tools like uniq and sort -u, especially with large datasets.
Can be used as a Rust library by disabling default features, allowing integration into custom applications for embedded deduplication.
Supports both sorted and unsorted data, with filters like sorted requiring pre-sorted input for optimal resource usage, enhancing versatility.
The compact filter does not guarantee exact uniqueness and can have rare false positives, making it unsuitable for critical applications where accuracy is paramount.
Installation requires the Rust toolchain via Cargo, which may be a barrier in environments without Rust or for users unfamiliar with Rust's ecosystem.
Benchmarks are based on specific data templates, and real-world performance may vary depending on input characteristics, as noted in the README, reducing predictability.
runiq is an open-source alternative to the following products:
uniq is a Unix command-line utility that filters adjacent matching lines from input, commonly used to report or filter duplicate lines in sorted text files.
sort -u is a command-line utility in Unix-like systems that sorts lines of text and removes duplicate entries (the -u flag stands for 'unique').