Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Python
  3. ftfy

ftfy

NOASSERTIONPythonv6.3.1

A Python library that fixes mojibake and other Unicode text glitches by detecting and correcting encoding mix-ups.

Visit WebsiteGitHubGitHub
4.0k stars126 forks0 contributors

What is ftfy?

ftfy is a Python library that fixes mojibake and other Unicode text glitches by detecting and correcting encoding mix-ups. It solves the problem of corrupted text data, such as when UTF-8 is misdecoded as another encoding, restoring text to its intended form without altering valid content.

Target Audience

Developers and data scientists working with messy text data from sources like web scraping, legacy systems, or user inputs, particularly in natural language processing (NLP) and data cleaning pipelines.

Value Proposition

ftfy offers a reliable, heuristic-based approach to fixing text corruption that avoids false positives, making it a trusted tool for preprocessing text where accuracy is critical. Its ability to handle complex, multi-layer encoding errors sets it apart from basic encoding converters.

Overview

Fixes mojibake and other glitches in Unicode text, after the fact.

Use Cases

Best For

  • Cleaning text data from web scraping where encoding issues are common
  • Fixing mojibake in legacy system exports or databases
  • Preprocessing text for NLP models to ensure consistent encoding
  • Decoding HTML entities in non-HTML contexts
  • Handling user-generated content with mixed or unknown encodings
  • Restoring text that has been through multiple incorrect encoding conversions

Not Ideal For

  • Real-time applications processing high volumes of text where speed is prioritized over accuracy
  • Systems with text using proprietary or obscure encodings not covered by ftfy's heuristics
  • Projects where text errors are semantic (e.g., spelling mistakes) rather than encoding-based
  • Environments requiring aggressive text normalization without false positive guarantees

Pros & Cons

Pros

Accurate Mojibake Detection

Leverages UTF-8's design to reliably detect and correct mojibake without altering valid text, as shown in examples like fixing '✔' to '✔' while leaving 'IL Y MARQUÉ…' unchanged.

Multi-Layer Error Correction

Handles text that has been misencoded multiple times, demonstrated by fixing complex strings like 'The Mona Lisa doesn’t have eyebrows.' to the intended form.

HTML Entity Flexibility

Decodes HTML entities even outside HTML and with incorrect capitalization, such as converting 'PÉREZ' to 'PÉREZ', which standard decoders might miss.

False Positive Avoidance

Prioritizes accuracy by never changing correctly-decoded text, ensuring safety in data cleaning pipelines, as highlighted in the README's philosophy section.

Cons

Limited Encoding Scope

Optimized for common mojibake patterns (e.g., UTF-8 misdecoded as Windows-1252), but may not fix issues with rare or non-Unicode encodings, as admitted in the documentation on 'bad encodings'.

Performance Overhead

Heuristic analysis can be slower for large datasets compared to simple encoding converters, making it less suitable for high-throughput real-time processing.

Configuration Complexity for Edge Cases

While easy to install, tuning ftfy for specific or novel corruption scenarios requires deep diving into documentation and heuristics, which may not be straightforward for all users.

Frequently Asked Questions

Quick Stats

Stars4,043
Forks126
Contributors0
Open Issues16
Last commit1 year ago
CreatedSince 2012

Tags

#data-cleaning#unicode#python-library#text-processing#character-encoding

Built With

P
Python

Links & Resources

Website

Included in

Python290.8k
Auto-fetched 22 hours ago

Related Projects

python-phonenumberspython-phonenumbers

Python port of Google's libphonenumber

Stars3,747
Forks442
Last commit3 days ago
textdistancetextdistance

📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Stars3,534
Forks258
Last commit1 year ago
chardetchardet

Python character encoding detector

Stars2,633
Forks300
Last commit1 month ago
shortuuidshortuuid

A generator library for concise, unambiguous and URL-safe UUIDs.

Stars2,188
Forks115
Last commit6 months ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub