Showing 11 of 11 projects
An unsupervised text tokenizer and detokenizer for neural network-based text generation systems with subword units.
A minimalistic, single-header JSON tokenizer/parser in C for resource-limited and embedded systems.
A blazing fast and feature-rich parser building toolkit for JavaScript, supporting LL(K) and LL(*) grammars.
A high-performance, browser-grade HTML5 parser written in Rust, developed as part of the Servo project.
A self-contained Japanese morphological analyzer written in pure Go, tokenizing text into words and analyzing parts of speech.
A Swift library for tokenizing strings using character sets and custom tokenizers when whitespace splitting is insufficient.
A comprehensive suite of Java NLP libraries and tools for text annotation, feature extraction, and language processing tasks.
A multilingual command-line sentence tokenizer written in Go, ported from NLTK's Punkt system.
A Rust implementation of OpenAI's tiktoken tokenizer for working with GPT models and token counting.
Bayesian text classifier for Go with flexible tokenizers and storage backends.
A high-performance, regex-free Go tokenizer for parsing strings, slices, and infinite streams into customizable tokens.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.