Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Python
  3. chardet

chardet

0BSDPython7.4.3

A Python library that automatically detects the character encoding of text files and byte streams with high accuracy and speed.

GitHubGitHub
2.6k stars300 forks0 contributors

What is chardet?

chardet is a Python library that automatically detects the character encoding of text files and byte streams. It solves the problem of handling unknown text encodings by analyzing byte patterns to determine whether data is UTF-8, Windows-1252, EUC-JP, or any of 99 supported encodings, preventing garbled text and decoding errors in applications that process diverse text sources.

Target Audience

Python developers working with text processing, data ingestion pipelines, web scraping, internationalization, or any application that handles files from unknown or multiple encoding sources.

Value Proposition

Developers choose chardet for its exceptional 99.3% accuracy, 47x speed improvement over previous versions, language detection capabilities, and backward-compatible API that makes it a drop-in replacement while offering modern features like MIME type detection and encoding filters.

Overview

Python character encoding detector

Use Cases

Best For

  • Processing text files with unknown character encodings in data pipelines
  • Web scraping applications that encounter pages with various encodings
  • Internationalization workflows requiring language detection alongside encoding detection
  • Legacy system integration where file encodings are undocumented or inconsistent
  • Building tools that need to distinguish between text files and binary formats
  • Handling user-uploaded files with unpredictable encoding in web applications

Not Ideal For

  • Applications requiring real-time encoding detection with sub-millisecond latency
  • Projects running on memory-constrained systems where 52.9 MiB peak usage is prohibitive
  • Environments stuck on Python versions older than 3.10
  • Situations where the exact encoding is always known and detection overhead is unnecessary

Pros & Cons

Pros

Exceptional Accuracy

Achieves 99.3% accuracy on 2,517 test files, significantly outperforming chardet 6.0.0 (88.2%) and charset-normalizer (85.4%), making it highly reliable for diverse text sources.

Blazing Speed

With mypyc compilation, it's 47x faster than chardet 6.0.0 and 1.5x faster than charset-normalizer, handling 551 files per second for efficient bulk processing.

Integrated Language Detection

Provides language detection with 95.7% accuracy across 49 languages, returned with every encoding result, aiding internationalization workflows without extra tools.

MIME Type and Streaming Support

Detects 40+ binary file formats via magic numbers and supports incremental detection for large files or network streams using UniversalDetector, enhancing versatility.

Cons

Higher Memory Consumption

Peak memory usage is 52.9 MiB, nearly double that of chardet 6.0.0, which could strain memory-intensive applications or embedded systems.

Optimization Dependency

Best performance requires mypyc compilation for a 1.67x speedup, adding a build step that may complicate deployment or not be feasible in all environments.

Python Version Limitation

Only supports Python 3.10 and above, excluding legacy projects on older versions that might need encoding detection without upgrading.

Frequently Asked Questions

Quick Stats

Stars2,633
Forks300
Contributors0
Open Issues0
Last commit1 month ago
CreatedSince 2012

Tags

#unicode#python-library#text-analysis#text-processing#character-encoding#language-detection#mime-type-detection

Built With

M
Mypyc
P
Python

Included in

Python290.8k
Auto-fetched 23 hours ago

Related Projects

ftfyftfy

Fixes mojibake and other glitches in Unicode text, after the fact.

Stars4,043
Forks126
Last commit1 year ago
python-phonenumberspython-phonenumbers

Python port of Google's libphonenumber

Stars3,747
Forks442
Last commit3 days ago
textdistancetextdistance

📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

Stars3,534
Forks258
Last commit1 year ago
shortuuidshortuuid

A generator library for concise, unambiguous and URL-safe UUIDs.

Stars2,188
Forks115
Last commit6 months ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub