A Python library for computing distances between sequences with 30+ algorithms, pure Python implementation, and optional external libraries for speed.
TextDistance is a Python library that computes distances and similarities between sequences using over 30 different algorithms. It solves the problem of comparing strings, tokens, or sequences for applications like fuzzy string matching, data cleaning, and text analysis by providing a unified interface for various distance metrics.
Python developers working on text processing, data deduplication, natural language processing, or any application requiring sequence comparison, such as researchers, data scientists, and software engineers.
Developers choose TextDistance for its extensive algorithm coverage, pure Python implementation for portability, optional external library integration for speed, and consistent interface that supports multi-sequence comparison, unlike many alternatives.
📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Includes over 30 algorithms across categories like edit-based, token-based, and compression-based, as detailed in the README's comprehensive tables, making it a one-stop shop for diverse comparison needs.
All implementations are in pure Python, ensuring cross-platform compatibility without external dependencies, which is emphasized in the 'Pure python implementation' feature for easy deployment.
Supports optional integration with faster external libraries like jellyfish and rapidfuzz via extras installation, with benchmarks showing speed improvements of up to 100x for algorithms like Levenshtein.
Provides consistent methods such as distance() and similarity() for all algorithms, and uniquely supports comparing more than two sequences at once, unlike many alternatives.
The pure Python core is significantly slower, with benchmarks showing algorithms like Levenshtein running 500x slower than optimized C libraries when external dependencies aren't used, limiting use in performance-sensitive scenarios.
Achieving optimal speed requires installing extras like 'textdistance[extras]', which adds deployment complexity and potential version conflicts, as noted in the installation instructions.
Only a subset of algorithms (e.g., DamerauLevenshtein, Hamming) have external library support; others, like compression-based methods, rely solely on slower pure Python implementations.