A Python library that fixes mojibake and other Unicode text glitches by detecting and correcting encoding mix-ups.
ftfy is a Python library that automatically detects and fixes common Unicode text corruption issues, such as mojibake (garbled text caused by encoding mismatches). It intelligently recovers the original intended characters from text that has been incorrectly decoded, making it essential for cleaning messy text data from sources like web scraping, legacy systems, or user inputs.
Developers and data scientists working with text data from multiple sources, especially those dealing with multilingual content, web scraping, legacy data migration, or natural language processing (NLP) pipelines where encoding errors are common.
ftfy stands out by using sophisticated heuristics to identify and correct encoding mix-ups without altering valid text, avoiding false positives. It handles complex real-world scenarios like multi-layer mojibake and HTML entities, providing a reliable, automated solution for a problem that is often tedious to fix manually.
Fixes mojibake and other glitches in Unicode text, after the fact.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses heuristics to identify and correct UTF-8 text incorrectly decoded as other encodings, effectively turning examples like '✔' into '✔' without altering valid text, as shown in the README.
Can handle text that has undergone multiple encoding errors, such as double-encoded mojibake, demonstrated by fixing 'The Mona Lisa doesn’t have eyebrows.' to the correct version.
Decodes HTML entities even with incorrect capitalization, useful for text outside HTML contexts, like converting 'PÉREZ' to 'PÉREZ', as highlighted in the examples.
Designed to never change correctly decoded text, ensuring reliability in automated pipelines, as emphasized in the philosophy and examples such as leaving 'IL Y MARQUÉ…' unchanged.
Primarily targets mojibake and HTML entities; it does not address other text issues like spelling errors, formatting problems, or non-encoding-related corruptions, which might require additional tools.
The heuristic analysis adds computational cost, which can be significant for large datasets or high-frequency processing, making it less suitable for real-time or high-throughput applications.
Requires Python 3, which could be a barrier for legacy systems or projects still using Python 2, though the README notes compatibility with pip3 for mixed environments.