A Python library that fixes mojibake and other Unicode text glitches by detecting and correcting encoding mix-ups.
ftfy is a Python library that fixes mojibake and other Unicode text glitches by detecting and correcting encoding mix-ups. It solves the problem of corrupted text data, such as when UTF-8 is misdecoded as another encoding, restoring text to its intended form without altering valid content.
Developers and data scientists working with messy text data from sources like web scraping, legacy systems, or user inputs, particularly in natural language processing (NLP) and data cleaning pipelines.
ftfy offers a reliable, heuristic-based approach to fixing text corruption that avoids false positives, making it a trusted tool for preprocessing text where accuracy is critical. Its ability to handle complex, multi-layer encoding errors sets it apart from basic encoding converters.
Fixes mojibake and other glitches in Unicode text, after the fact.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages UTF-8's design to reliably detect and correct mojibake without altering valid text, as shown in examples like fixing '✔' to '✔' while leaving 'IL Y MARQUÉ…' unchanged.
Handles text that has been misencoded multiple times, demonstrated by fixing complex strings like 'The Mona Lisa doesn’t have eyebrows.' to the intended form.
Decodes HTML entities even outside HTML and with incorrect capitalization, such as converting 'PÉREZ' to 'PÉREZ', which standard decoders might miss.
Prioritizes accuracy by never changing correctly-decoded text, ensuring safety in data cleaning pipelines, as highlighted in the README's philosophy section.
Optimized for common mojibake patterns (e.g., UTF-8 misdecoded as Windows-1252), but may not fix issues with rare or non-Unicode encodings, as admitted in the documentation on 'bad encodings'.
Heuristic analysis can be slower for large datasets compared to simple encoding converters, making it less suitable for high-throughput real-time processing.
While easy to install, tuning ftfy for specific or novel corruption scenarios requires deep diving into documentation and heuristics, which may not be straightforward for all users.