A Ruby natural language processor for tokenizing and analyzing text with flexible filtering and custom regex support.
WordsCounted is a Ruby natural language processing library that tokenizes and analyzes text to extract detailed statistics like word frequencies, densities, and lengths. It solves the problem of implementing custom text analysis workflows by providing flexible tokenization with support for custom regex patterns and exclusion filters.
Ruby developers working on text analysis, linguistic processing, content mining, or NLP applications who need programmatic control over tokenization and text statistics.
Developers choose WordsCounted for its powerful and flexible tokenization options, comprehensive built-in analytics, and seamless handling of UTF/unicode characters—all in a lightweight Ruby gem.
A Ruby natural language processor.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports custom regex patterns and exclusion filters using strings, regexps, lambdas, symbols, or arrays, allowing precise control over token extraction as shown in the README examples.
Out-of-the-box analysis includes token counts, frequencies, densities, lengths, character counts, and identification of longest/most frequent tokens, providing detailed insights from text data.
Correctly processes diacritics and special characters, treating words like 'Bayrūt' as single tokens instead of splitting them, ensuring accurate multilingual text analysis.
Reads text directly from files or URLs using the from_file method, simplifying data ingestion without manual pre-processing for basic use cases.
Focuses solely on tokenization and basic statistics, lacking features for more complex NLP tasks like part-of-speech tagging or semantic analysis, which may require additional libraries.
The README admits gotchas where hyphens used as dashes cause incorrect tokenization (e.g., '-you' treated as a separate token), requiring manual regex adjustments to fix.
The roadmap lists 'Ability to open URLs' as a future feature, implying current support may not handle HTML stripping or advanced URL parsing, limiting out-of-the-box utility.