Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Machine Learning
  3. jieba

jieba

MITPythonv0.42.1

A Python library for Chinese text segmentation, offering multiple modes, custom dictionaries, and keyword extraction.

GitHubGitHub
34.9k stars6.7k forks0 contributors

What is jieba?

Jieba is a Python library for Chinese text segmentation, which splits Chinese text into individual words. It solves the problem of word boundary detection in Chinese, where spaces are not used to separate words, making it essential for text analysis, search indexing, and NLP tasks.

Target Audience

Developers and researchers working with Chinese text processing, including those in natural language processing, search engine development, data mining, and academic research.

Value Proposition

Developers choose Jieba for its high accuracy, multiple segmentation modes, ease of use, and extensive features like custom dictionaries and keyword extraction, all under an MIT license.

Overview

结巴中文分词

Use Cases

Best For

  • Building search engines that require Chinese text indexing
  • Performing text analysis on Chinese corpora for research
  • Extracting keywords from Chinese documents using TF-IDF or TextRank
  • Adding part-of-speech tagging to Chinese NLP pipelines
  • Processing large volumes of Chinese text with parallel segmentation
  • Integrating Chinese segmentation into Python-based applications

Not Ideal For

  • Applications requiring real-time, streaming text segmentation with sub-millisecond latency
  • Teams integrated with modern deep learning frameworks like TensorFlow or PyTorch seeking end-to-end neural segmentation models
  • Projects running on Windows that need parallel processing for speed optimization
  • Environments where multilingual text segmentation beyond Chinese is a primary requirement

Pros & Cons

Pros

Versatile Segmentation Modes

Offers four modes—precise, full, search engine, and paddle—allowing developers to trade off accuracy and speed for different use cases, as shown in the code examples for text analysis and search indexing.

Extensive Customization Options

Supports custom dictionaries and dynamic word frequency adjustment via methods like add_word and suggest_freq, enabling domain-specific tuning to improve segmentation accuracy, detailed in the user dictionary section.

Built-in NLP Utilities

Includes keyword extraction using TF-IDF and TextRank algorithms, plus part-of-speech tagging compatible with ictclas standards, reducing the need for additional libraries in text processing pipelines.

Performance Optimizations

Implements parallel processing for faster segmentation on large texts (though not on Windows) and efficient prefix dictionary scanning, achieving speeds up to 1.5 MB/s in full mode per the benchmark.

Cons

Windows Parallel Processing Limitation

Parallel processing is not supported on Windows, which significantly hampers performance gains for teams using that OS for large-scale text analysis, as noted in the README.

Outdated Deep Learning Dependency

The paddle mode relies on an old version of PaddlePaddle (v1.6.1), which may conflict with newer installations and lacks updates, limiting its usefulness for cutting-edge neural segmentation tasks.

Documentation Hierarchy Barrier

The README presents Chinese documentation first with English below, which can inconvenience non-Chinese speakers despite translations, potentially slowing onboarding for international teams.

Frequently Asked Questions

Quick Stats

Stars34,920
Forks6,702
Contributors0
Open Issues645
Last commit1 year ago
CreatedSince 2012

Tags

#part-of-speech-tagging#python-library#multilingual-support#natural-language-processing#tokenization#text-segmentation#search-engine#keyword-extraction#chinese-nlp

Built With

P
Python

Included in

Machine Learning72.2k
Auto-fetched 1 day ago

Related Projects

HuggingFace TransformersHuggingFace Transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Stars159,772
Forks32,981
Last commit1 day ago
spacyspacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars33,501
Forks4,676
Last commit27 days ago
HaystackHaystack

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

Stars24,954
Forks2,731
Last commit2 days ago
RasaRasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Stars21,135
Forks4,909
Last commit2 months ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub