A Rust library for natural language detection using trigram models, focusing on simplicity and performance.
Whatlang is a Rust library for natural language detection that identifies the language and script of text using trigram models. It solves the problem of efficiently determining the language of user-generated content in applications like search engines and text processing tools. The library supports 70 languages and provides reliability metrics to ensure accurate detection.
Rust developers building applications that require language identification, such as search engines, content moderation systems, or multilingual text processing tools. It is particularly useful for projects prioritizing performance and simplicity in NLP tasks.
Developers choose Whatlang for its focus on performance, simplicity, and reliability, with a pure Rust implementation that ensures speed and safety. It offers a straightforward API, script detection, and confidence scoring, making it a lightweight alternative to heavier NLP libraries.
Natural language detection library for Rust. Try demo online: https://whatlang.org/
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Detects 70 languages and distinguishes scripts like Latin and Cyrillic, covering most common use cases effectively, as listed in the SUPPORTED_LANGUAGES.md file.
Provides confidence scores and is_reliable flags based on trigram uniqueness and language difference, helping assess detection accuracy with a calculated threshold function.
Optimized in Rust with performance benchmarks, ensuring fast and lightweight operation suitable for real-time applications, as used in projects like Meilisearch and Sonic.
Offers a straightforward detect function and feature toggles for serde and enum-map, reducing integration complexity and aligning with the philosophy of simplicity.
Supports only 70 languages, fewer than alternatives like CLD3's 107, which may exclude niche languages and limit global applicability.
Cannot parse HTML directly, unlike CLD2, requiring manual text extraction for web content and adding preprocessing overhead.
The trigram-based model may struggle with very short texts or mixed-language content, as reliability depends on text length and uniqueness, potentially reducing accuracy in edge cases.