A Python library for Chinese text segmentation, offering multiple modes, custom dictionaries, and keyword extraction.
Jieba is a Python library for Chinese text segmentation, which splits Chinese text into individual words. It solves the problem of word boundary detection in Chinese, where spaces are not used to separate words, making it essential for text analysis, search indexing, and NLP tasks.
Developers and researchers working with Chinese text processing, including those in natural language processing, search engine development, data mining, and academic research.
Developers choose Jieba for its high accuracy, multiple segmentation modes, ease of use, and extensive features like custom dictionaries and keyword extraction, all under an MIT license.
结巴中文分词
Offers four modes—precise, full, search engine, and paddle—allowing developers to trade off accuracy and speed for different use cases, as shown in the code examples for text analysis and search indexing.
Supports custom dictionaries and dynamic word frequency adjustment via methods like add_word and suggest_freq, enabling domain-specific tuning to improve segmentation accuracy, detailed in the user dictionary section.
Includes keyword extraction using TF-IDF and TextRank algorithms, plus part-of-speech tagging compatible with ictclas standards, reducing the need for additional libraries in text processing pipelines.
Implements parallel processing for faster segmentation on large texts (though not on Windows) and efficient prefix dictionary scanning, achieving speeds up to 1.5 MB/s in full mode per the benchmark.
Parallel processing is not supported on Windows, which significantly hampers performance gains for teams using that OS for large-scale text analysis, as noted in the README.
The paddle mode relies on an old version of PaddlePaddle (v1.6.1), which may conflict with newer installations and lacks updates, limiting its usefulness for cutting-edge neural segmentation tasks.
The README presents Chinese documentation first with English below, which can inconvenience non-Chinese speakers despite translations, potentially slowing onboarding for international teams.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.