How do I detect encoding of a large file without loading it all into memory?

Use the UniversalDetector class to feed data incrementally—it supports streaming detection, allowing you to process files or network streams piece by piece until confidence is reached, ideal for big data or real-time inputs.

chardet vs charset-normalizer: which is better for my project?

chardet 7 offers higher accuracy (99.3% vs 85.4%) and is faster when compiled with mypyc, plus extra features like language and MIME type detection. However, charset-normalizer is MIT licensed and may suffice for simpler use cases without compilation.

Is chardet thread-safe for multi-threaded applications?

Yes, the detect() and detect_all() functions are thread-safe and scale on free-threaded Python, making it suitable for concurrent calls in web servers or parallel processing pipelines.

How can I restrict chardet to only check for UTF-8 and Windows-1252?

Use the include_encodings parameter (e.g., include_encodings=["utf-8", "windows-1252"]) to limit detection to specific encodings, reducing false positives and improving performance in controlled environments.

Does chardet work with binary files like images or PDFs?

Yes, it includes MIME type detection that identifies 40+ binary formats via magic number signatures, helping distinguish between text files and non-text files like images, audio, or documents.

What's the accuracy of chardet's language detection feature?

Language detection has 95.7% accuracy across 49 languages, provided with each encoding result—useful for applications needing both encoding and language info, such as content analysis or localization.

Open-Awesome

chardet

0BSDPython7.4.3

A Python library that automatically detects the character encoding of text files and byte streams with high accuracy and speed.

GitHub

2.6k stars300 forks0 contributors

What is chardet?

chardet is a Python library that automatically detects the character encoding of text files and byte streams. It solves the problem of handling unknown text encodings by analyzing byte patterns to determine whether data is UTF-8, Windows-1252, EUC-JP, or any of 99 supported encodings, preventing garbled text and decoding errors in applications that process diverse text sources.

Target Audience

Python developers working with text processing, data ingestion pipelines, web scraping, internationalization, or any application that handles files from unknown or multiple encoding sources.

Value Proposition

Developers choose chardet for its exceptional 99.3% accuracy, 47x speed improvement over previous versions, language detection capabilities, and backward-compatible API that makes it a drop-in replacement while offering modern features like MIME type detection and encoding filters.

Overview

Python character encoding detector

Use Cases

Best For

Processing text files with unknown character encodings in data pipelines
Web scraping applications that encounter pages with various encodings
Internationalization workflows requiring language detection alongside encoding detection
Legacy system integration where file encodings are undocumented or inconsistent
Building tools that need to distinguish between text files and binary formats
Handling user-uploaded files with unpredictable encoding in web applications

Not Ideal For

Applications requiring real-time encoding detection with sub-millisecond latency
Projects running on memory-constrained systems where 52.9 MiB peak usage is prohibitive
Environments stuck on Python versions older than 3.10
Situations where the exact encoding is always known and detection overhead is unnecessary

Pros & Cons

Pros

Exceptional Accuracy

Achieves 99.3% accuracy on 2,517 test files, significantly outperforming chardet 6.0.0 (88.2%) and charset-normalizer (85.4%), making it highly reliable for diverse text sources.

Blazing Speed

With mypyc compilation, it's 47x faster than chardet 6.0.0 and 1.5x faster than charset-normalizer, handling 551 files per second for efficient bulk processing.

Integrated Language Detection

Provides language detection with 95.7% accuracy across 49 languages, returned with every encoding result, aiding internationalization workflows without extra tools.

MIME Type and Streaming Support

Detects 40+ binary file formats via magic numbers and supports incremental detection for large files or network streams using UniversalDetector, enhancing versatility.

Cons

Higher Memory Consumption

Peak memory usage is 52.9 MiB, nearly double that of chardet 6.0.0, which could strain memory-intensive applications or embedded systems.

Optimization Dependency

Best performance requires mypyc compilation for a 1.67x speedup, adding a build step that may complicate deployment or not be feasible in all environments.

Python Version Limitation

Only supports Python 3.10 and above, excluding legacy projects on older versions that might need encoding detection without upgrading.

Frequently Asked Questions

Related Projects

ftfy

Fixes mojibake and other glitches in Unicode text, after the fact.

Stars4,043

Forks126

Last commit1 year ago

python-phonenumbers

Python port of Google's libphonenumber

Stars3,747

Forks442

Last commit3 days ago

textdistance

📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Stars3,534

Forks258