A Java library implementing various string similarity and distance algorithms like Levenshtein, Jaro-Winkler, and n-gram methods.
Java String Similarity is a library that implements a wide range of algorithms for calculating the similarity or distance between two strings. It solves the problem of comparing text data by providing methods like Levenshtein edit distance, Jaro-Winkler similarity, and n-gram based approaches, which are essential for applications like spell checking, record linkage, and text analysis.
Java developers working on text processing, data deduplication, search engines, or natural language processing tasks that require fuzzy string matching or similarity measurement.
Developers choose this library because it consolidates many well-known string comparison algorithms into a single, easy-to-use Java package with clear interfaces and documentation, eliminating the need to implement these algorithms from scratch.
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Implements over a dozen classic algorithms like Levenshtein, Jaro-Winkler, and cosine similarity, covering most common string comparison needs without requiring external implementations.
Defines interfaces for normalized vs. metric distances, as detailed in the README's table, helping users select algorithms based on mathematical properties like triangle inequality.
Supports shingle-based methods (e.g., Q-Gram, Cosine) that allow pre-computing string profiles for O(m+n) comparisons, ideal for large datasets as shown in the precomputed cosine example.
Offers weighted Levenshtein for OCR/keyboard correction and optimal string alignment for restricted edit distance, addressing niche use cases directly mentioned in the documentation.
Lacks contemporary techniques like neural embeddings or transformer-based similarity, limiting its relevance for advanced NLP tasks that go beyond traditional algorithms.
Core algorithms like Levenshtein use O(m*n) dynamic programming, which the README admits can be slow for long strings, with no implementation of faster methods like Four Russians optimization.
Includes SIFT4 as experimental without guarantees on API stability or optimization, potentially risking breaking changes in future updates for users relying on it.