How to install jieba with paddle mode for better accuracy?

First install jieba via pip, then specifically install paddlepaddle-tiny version 1.6.1 with 'pip install paddlepaddle-tiny==1.6.1'. Enable it in code using jieba.enable_paddle() before segmentation, as detailed in the paddle mode section of the README.

Jieba vs Stanford CoreNLP for Chinese segmentation – which is better?

Jieba is Python-focused, lightweight, and easier to set up for basic to intermediate tasks, while Stanford CoreNLP offers more comprehensive NLP features but requires Java and has higher resource overhead. For Python projects needing fast, customizable segmentation, jieba is often the better choice.

How to add custom words to jieba dictionary?

Use jieba.load_userdict() with a UTF-8 file containing words, optional frequencies, and POS tags, or dynamically modify the dictionary with jieba.add_word(). The README provides format examples and code snippets for accurate customization.

Does jieba work with Python 3?

Yes, jieba is fully compatible with both Python 2 and 3, as stated in the installation instructions. Ensure you use pip3 for Python 3 installations to avoid conflicts.

How to extract keywords from Chinese text using jieba?

Import jieba.analyse and use extract_tags() for TF-IDF or textrank() for TextRank algorithms. You can customize IDF and stop words files with set_idf_path() and set_stop_words() for domain-specific results, as shown in the examples.

Is jieba fast enough for processing large datasets?

Jieba can handle large texts efficiently with parallel processing on Linux/macOS, achieving up to 1.5 MB/s in full mode. For massive datasets, enable parallel mode with jieba.enable_parallel() and optimize with custom dictionaries, though Windows users face limitations.

Open-Awesome

jieba

MITPythonv0.42.1

A Python library for Chinese text segmentation, offering multiple modes, custom dictionaries, and keyword extraction.

GitHub

35.0k stars6.7k forks0 contributors

What is jieba?

Jieba is a Python library for Chinese text segmentation, which splits Chinese text into individual words. It solves the problem of word boundary detection in Chinese, where spaces are not used to separate words, making it essential for text analysis, search indexing, and NLP tasks.

Target Audience

Developers and researchers working with Chinese text processing, including those in natural language processing, search engine development, data mining, and academic research.

Value Proposition

Developers choose Jieba for its high accuracy, multiple segmentation modes, ease of use, and extensive features like custom dictionaries and keyword extraction, all under an MIT license.

Overview

结巴中文分词

Use Cases

Best For

Building search engines that require Chinese text indexing
Performing text analysis on Chinese corpora for research
Extracting keywords from Chinese documents using TF-IDF or TextRank
Adding part-of-speech tagging to Chinese NLP pipelines
Processing large volumes of Chinese text with parallel segmentation
Integrating Chinese segmentation into Python-based applications

Not Ideal For

Applications requiring real-time, streaming text segmentation with sub-millisecond latency
Teams integrated with modern deep learning frameworks like TensorFlow or PyTorch seeking end-to-end neural segmentation models
Projects running on Windows that need parallel processing for speed optimization
Environments where multilingual text segmentation beyond Chinese is a primary requirement

Pros & Cons

Pros

Versatile Segmentation Modes

Offers four modes—precise, full, search engine, and paddle—allowing developers to trade off accuracy and speed for different use cases, as shown in the code examples for text analysis and search indexing.

Extensive Customization Options

Supports custom dictionaries and dynamic word frequency adjustment via methods like add_word and suggest_freq, enabling domain-specific tuning to improve segmentation accuracy, detailed in the user dictionary section.

Built-in NLP Utilities

Includes keyword extraction using TF-IDF and TextRank algorithms, plus part-of-speech tagging compatible with ictclas standards, reducing the need for additional libraries in text processing pipelines.

Performance Optimizations

Implements parallel processing for faster segmentation on large texts (though not on Windows) and efficient prefix dictionary scanning, achieving speeds up to 1.5 MB/s in full mode per the benchmark.

Cons

Windows Parallel Processing Limitation

Parallel processing is not supported on Windows, which significantly hampers performance gains for teams using that OS for large-scale text analysis, as noted in the README.

Outdated Deep Learning Dependency

The paddle mode relies on an old version of PaddlePaddle (v1.6.1), which may conflict with newer installations and lacks updates, limiting its usefulness for cutting-edge neural segmentation tasks.

Documentation Hierarchy Barrier

The README presents Chinese documentation first with English below, which can inconvenience non-Chinese speakers despite translations, potentially slowing onboarding for international teams.

Frequently Asked Questions

Related Projects

HuggingFace Transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars33,637

Forks4,686

Last commit20 days ago

Haystack

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

Stars25,487

Forks2,832

Last commit3 days ago

Rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Stars21,199

Forks4,915

Last commit17 days ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub