A Python library that automatically detects the character encoding of text files and byte streams with high accuracy and speed.
chardet is a Python library that automatically detects the character encoding of text files and byte streams. It solves the problem of handling unknown text encodings by analyzing byte patterns to determine whether data is UTF-8, Windows-1252, EUC-JP, or any of 99 supported encodings, preventing garbled text and decoding errors in applications that process diverse text sources.
Python developers working with text processing, data ingestion pipelines, web scraping, internationalization, or any application that handles files from unknown or multiple encoding sources.
Developers choose chardet for its exceptional 99.3% accuracy, 47x speed improvement over previous versions, language detection capabilities, and backward-compatible API that makes it a drop-in replacement while offering modern features like MIME type detection and encoding filters.
Python character encoding detector
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Achieves 99.3% accuracy on 2,517 test files, significantly outperforming chardet 6.0.0 (88.2%) and charset-normalizer (85.4%), making it highly reliable for diverse text sources.
With mypyc compilation, it's 47x faster than chardet 6.0.0 and 1.5x faster than charset-normalizer, handling 551 files per second for efficient bulk processing.
Provides language detection with 95.7% accuracy across 49 languages, returned with every encoding result, aiding internationalization workflows without extra tools.
Detects 40+ binary file formats via magic numbers and supports incremental detection for large files or network streams using UniversalDetector, enhancing versatility.
Peak memory usage is 52.9 MiB, nearly double that of chardet 6.0.0, which could strain memory-intensive applications or embedded systems.
Best performance requires mypyc compilation for a 1.67x speedup, adding a build step that may complicate deployment or not be feasible in all environments.
Only supports Python 3.10 and above, excluding legacy projects on older versions that might need encoding detection without upgrading.