A library that removes common Unicode confusables, homoglyphs, and diacritics from strings to normalize text.
decancer is a library that removes common Unicode confusables, homoglyphs, and diacritics from strings. It solves the problem of text obfuscation used in spam, evasion, or visual confusion by normalizing input to a clean, readable form. The library is built for performance and accuracy, supporting multiple programming languages.
Developers building applications that require text sanitization, such as chat platforms, content moderation systems, or security tools that need to handle user-generated input safely.
Unlike other text filtering libraries, decancer is unicode bidirectional-aware and handles a wide range of confusables efficiently. Its Rust core ensures high speed, and its multi-language bindings make it accessible across different tech stacks.
A library that removes common unicode confusables/homoglyphs from strings.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Core is written in Rust using binary search, ensuring fast processing for filtering over 222,557 Unicode codepoints, as highlighted in the README.
Correctly interprets right-to-left characters as rendered, unlike other packages, improving accuracy in security and moderation contexts.
Filters homoglyphs, diacritics, leetspeak, Zalgo text, and emojis, covering a wide range of evasion techniques.
Available in Rust, JavaScript, Java, C/C++, Go, and Python, making it accessible across diverse tech stacks for integration.
Highly configurable behavior allows tailoring to specific needs, as emphasized in the philosophy and examples.
Installation for Go requires Rust and elevated permissions, and C/C++ involves platform-specific downloads, adding deployment complexity.
Python bindings are listed as unofficial, which may lead to inconsistent updates or limited maintenance compared to core languages.
Output is a CuredString object that shouldn't be coerced to regular strings, complicating integration with existing string processing code.
Aggressive filtering of many codepoints might inadvertently remove legitimate characters in niche use cases, despite customizability.