A multilingual command-line sentence tokenizer written in Go, ported from NLTK's Punkt system.
Sentences is a Go library and command-line tool for sentence tokenization—splitting text into individual sentences. It solves the problem of accurately identifying sentence boundaries across multiple languages, especially in cases involving abbreviations or ambiguous punctuation. The tool is based on the Punkt unsupervised learning algorithm, originally from NLTK.
Developers and researchers working with text processing, NLP pipelines, or multilingual applications who need reliable sentence segmentation without heavy dependencies.
It offers a fast, dependency-free implementation of a proven algorithm with support for 13 languages, making it a lightweight yet accurate alternative to heavier NLP libraries.
A multilingual command line sentence tokenizer in Golang
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports 13 languages using the Punkt algorithm, which trains on text to accurately handle abbreviations and collocations, as highlighted in the README's performance comparison.
Written in Go with no external libraries, making it lightweight and easy to integrate into projects without dependency bloat, as emphasized in the features.
Benchmarked faster than NLTK with an average speed of 1.96 seconds for 10 runs on the Brown Corpus, while maintaining competitive accuracy.
Core components are composable, allowing customization for specific languages or rules, as demonstrated in the 'Customize' section of the README.
The author admits not testing languages besides English, so reliability for other supported languages might be untested and require user contributions.
Requires loading JSON training files from the repo, adding an extra step compared to libraries with built-in models, as shown in the usage example.
On the Brown Corpus, accuracy is 98.95% vs NLTK's 99.21%, which could be a drawback for precision-sensitive applications.