Native and Redis-backed Bloom filter implementations in Ruby for probabilistic set membership testing.
BloomFilter(s) in Ruby is a library that implements Bloom filters, a probabilistic data structure for testing set membership with space efficiency. It solves the problem of quickly checking whether an item is in a large set without storing the entire set, using both native C extensions and Redis-backed storage for flexibility. The library supports counting and non-counting variants, making it suitable for applications like duplicate detection, cache filtering, and flow analysis.
Ruby developers building applications that require efficient membership queries on large datasets, such as web crawlers, caching systems, or data pipelines where false positives are acceptable but memory usage must be minimized.
Developers choose this library for its multiple implementation options (native and Redis-backed), configurable performance parameters, and ease of integration into Ruby projects, offering a balance of speed, memory efficiency, and scalability without external dependencies beyond Redis for distributed use.
BloomFilter(s) in Ruby: Native counting filter + Redis counting/non-counting filters
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Offers native C in-memory filters and Redis-backed versions, providing flexibility for both standalone and distributed use cases, as detailed in the README examples.
Redis non-counting filter uses minimal memory, e.g., only 2.5 MB for 1 million items at a 1% false positive rate, making it scalable for large datasets.
Includes Redis counting filters with time-based expiration (TTL), enabling automatic data cleanup and deletion capabilities, as shown in the code snippets.
Allows fine-tuning of parameters like bit array size, hash functions, and seed values to optimize false positive rates and performance for specific needs.
Relies on seeding a single CRC32 hash with different values for multiple hashes, which the README admits may not provide good distribution for all data types.
The counting Redis filter requires a separate key for each counter, leading to significantly higher memory usage compared to the non-counting version.
Distributed and persistent features are locked behind Redis, adding infrastructure complexity and limiting portability for projects without it.
Users must understand Bloom filter theory to configure size, hashes, and other parameters correctly, which can be error-prone and time-consuming.