A rule-based sentence boundary detection gem for Ruby that works out-of-the-box across many languages.
Pragmatic Segmenter is a Ruby library for sentence boundary detection, the task of splitting text into individual sentences. It solves the problem of ambiguous punctuation—like periods in abbreviations or numbers—by using language-specific rules to determine where sentences truly begin and end. It works across many languages and is designed for practical, real-world text processing without requiring machine learning models.
Developers and researchers working on natural language processing, translation tools, or text analysis pipelines who need reliable sentence segmentation in multiple languages. It's particularly useful for those building translation memory systems or processing multilingual documents.
Developers choose Pragmatic Segmenter for its high accuracy on edge cases, broad language support, and rule-based design that works without training data. It outperforms many other segmentation tools on standardized tests and is optimized for translation workflows, making it a robust, ready-to-use solution.
Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Achieves 98.08% on English Golden Rules tests and supports over a dozen languages with specific punctuation handling, outperforming tools like NLTK's Punkt in accuracy benchmarks.
Designed for translation memory, it conservatively keeps parentheticals and quotations as single segments to maintain coherence, ideal for NLP pipelines in translation tools.
Includes preprocessing for PDF line breaks, HTML, and table of contents artifacts, reducing the need for external cleaning steps in document processing workflows.
Allows turning off cleaning and specifying document types, with a conservative approach that avoids splitting on ambiguous boundaries, ensuring reliable output.
Being purely rule-based, it cannot adapt to new edge cases without manual updates, and Golden Rule #18 (a.m./p.m. with capitalized abbreviations) remains unsolved, indicating hard limitations.
In speed benchmarks, it averages 3.84 seconds on test data, slower than alternatives like Scapel (0.13 s), which may impact high-throughput applications.
Certain languages like Thai are explicitly listed as TODO, and adding new languages requires creating extensive abbreviation lists and rules, limiting out-of-the-box usability.