A C++ and Python library for efficient extraction and analysis of n-grams, skipgrams, and flexgrams from large corpora.
Colibri Core is an NLP tool and library for extracting and analyzing linguistic patterns like n-grams, skipgrams, and flexgrams from large text corpora efficiently. It solves the problem of high memory usage in traditional pattern extraction by using compressed representations and intelligent counting algorithms. The core tool, colibri-patternmodeller, allows users to build, query, and manipulate pattern models with various statistical insights.
Computational linguists, NLP researchers, and data scientists working with large text datasets who need efficient pattern extraction and analysis. It is also suitable for developers building linguistic analysis tools or integrating pattern modeling into applications.
Developers choose Colibri Core for its memory-efficient design, which enables processing of large corpora without prohibitive resource demands. Its support for advanced pattern types (skipgrams and flexgrams) and indexed models for detailed statistics offers more insights than basic n-gram tools.
Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool colibri-patternmodeller whi ch allows you to build, view, manipulate and query pattern models.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses compressed binary representation where frequent word classes take less space, reducing memory and disk usage as highlighted in the README's optimization techniques.
Implements informed iterative counting to discard patterns below thresholds early, speeding up processing for large corpora without unnecessary resource consumption.
Handles n-grams, skipgrams, and flexgrams with variable gaps, enabling detailed linguistic analysis beyond basic extraction, as described in the pattern categories.
Offers standalone command-line tools, a C++ library, and a Python library, providing flexibility for integration into diverse workflows and systems.
The Python binding is not available for Windows, restricting its use in Windows-only environments despite availability on Unix-like systems and via containers.
Requires compilation from source with multiple dependencies if pre-built packages are unavailable, which can be daunting for non-expert users, as noted in the installation section.
Concentrates solely on pattern extraction and analysis, lacking built-in support for other common NLP tasks like tokenization or classification, making it less versatile for broader pipelines.