A Python library for computing string similarity metrics including Levenshtein, Hamming, Jaccard, and Sorensen distances.
Distance is a Python library for computing similarity metrics between sequences, such as strings, tuples, or lists. It implements algorithms like Levenshtein and Hamming distances to measure how different two sequences are, which is useful for tasks like spell-checking, data deduplication, and natural language processing.
Python developers working on text processing, data cleaning, NLP applications, or anyone needing to compare sequences for similarity in a performant way.
It provides a comprehensive set of distance metrics in one package, with both pure Python and optional C extensions for speed, making it versatile for both prototyping and production use.
Levenshtein and Hamming distance computation
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Optional C implementations for key functions like Levenshtein and lcsubstrings provide significant speed boosts, crucial for production-scale data processing, as highlighted in the installation and usage sections.
Supports comparison of arbitrary sequences including tuples and lists, enabling use cases from phoneme analysis to sentence similarity, as demonstrated in the examples with syllables and word lists.
Offers normalized versions of Hamming and Levenshtein distances, plus inherently normalized Jaccard and Sorensen coefficients, allowing for meaningful cross-metric comparisons, as shown in the usage examples.
Includes iterators like ifast_comp that can handle millions of tokens efficiently, making it suitable for filtering large datasets by similarity, with examples provided in the README.
Last updated in 2013, which may cause compatibility issues with newer Python versions and lacks modern features or optimizations, as evident from the changelog.
Installing C extensions requires a compiler and specific commands like '--with-c', which can be a barrier in environments like serverless platforms or containers without build tools.
While it covers core metrics, it misses advanced methods like cosine similarity or weighted edit distances, limiting its applicability for some NLP or custom similarity tasks.
Documentation is minimal, relying on inline help via help(funcname), which may not suffice for complex use cases or troubleshooting beyond basic examples.