Catalyst is a high-performance C# NLP library inspired by spaCy, offering pre-trained models, entity recognition, and embedding training.
Catalyst is a high-performance Natural Language Processing library for C# and .NET, designed to bring spaCy-like capabilities to the .NET ecosystem. It provides pre-trained models, fast tokenization, entity recognition, and tools for training word and document embeddings, enabling developers to integrate advanced NLP into their applications efficiently.
.NET developers and data scientists who need to perform text analysis, entity extraction, or language understanding tasks within C# applications, particularly those looking for a performant alternative to Python-based NLP libraries.
Catalyst offers a pure-C# implementation with cross-platform support, exceptional speed (over 1M tokens/s), and a comprehensive feature set including pre-trained models and embedding training, making it a compelling choice for .NET teams needing robust NLP without external dependencies.
🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Achieves over 1 million tokens per second on modern CPUs with minimal RegEx usage, as benchmarked in the README.
Runs on .NET Standard 2.0, compatible with Windows, Linux, macOS, and ARM systems, ensuring broad deployment options.
Supports gazetteer-based, rule-based, and perceptron-based models for named entity extraction, offering multiple approaches as detailed in the features.
Provides out-of-the-box tools for training FastText and StarSpace embeddings, simplifying custom model development without external dependencies.
Uses MessagePack for fast binary storage and lazy loading via NuGet packages, reducing overhead in model management.
The README notes that pre-trained embedding models are 'coming soon,' forcing users to train their own, which can be resource-intensive.
Relies on Universal Dependencies for pre-trained models, which may not cover all niche languages or domains compared to competitors like spaCy.
Requires pre-registering languages and installing separate NuGet packages for each, adding complexity to initial configuration.
As a newer .NET-focused library, it has a smaller user base and fewer resources than established Python NLP libraries, affecting extensibility.
Catalyst is an open-source alternative to the following products: