A Ruby gem for calculating text similarity using tf*idf and BM25 vector space models.
tf-idf-similarity is a Ruby gem that calculates the similarity between texts using a bag-of-words Vector Space Model with tf*idf weights. It implements the same tf*idf formula used by Lucene, Sphinx, and Ferret, providing a reliable way to compare documents based on term importance. The library also supports the Okapi BM25 ranking function and integrates with faster matrix libraries like NArray for performance.
Ruby developers working on information retrieval, text analysis, or natural language processing projects who need to compute document similarities. It's particularly useful for those building search engines, recommendation systems, or content analysis tools.
Developers choose tf-idf-similarity because it accurately implements the tf*idf formula used by major search engines, which was missing in other Ruby gems at its creation. It offers flexibility with custom tokenization, support for multiple matrix libraries for speed, and includes both tf*idf and BM25 models.
Ruby gem to calculate the similarity between texts using tf*idf
Implements the same tf-idf formula used by Lucene, Sphinx, and Ferret, ensuring consistency with established search engines and addressing historical gaps in Ruby gems.
Supports integration with faster matrix libraries like NArray, GSL, and NMatrix, with NArray noted for the best performance, allowing optimization for speed.
Allows developers to provide custom tokens, exclude stop words, and handle term counts manually, as shown in README examples with UnicodeUtils.
Offers the Okapi BM25 ranking function as an alternative similarity measure, providing a modern retrieval function alongside traditional tf-idf.
The README explicitly advises using Lucene for performance-demanding use cases, indicating inherent limitations in Ruby-based implementations for large-scale tasks.
To achieve better performance, users must install external libraries like GSL or NArray, which involve native dependencies and increase setup complexity, as noted in the troubleshooting section.
Only includes tf-idf and BM25 models, whereas Lucene offers advanced similarity functions like DFR and language models, making it less suitable for cutting-edge IR research.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.