Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Machine Learning
  3. gensim

gensim

LGPL-2.1Python4.4.0

A Python library for topic modeling, document indexing, and similarity retrieval with large text corpora.

Visit WebsiteGitHubGitHub
16.4k stars4.4k forks0 contributors

What is gensim?

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large text corpora. It helps users discover hidden semantic structures in documents using algorithms like Latent Dirichlet Allocation (LDA) and word2vec. The library solves the problem of analyzing massive text collections efficiently while being memory-independent.

Target Audience

Natural language processing (NLP) and information retrieval (IR) practitioners, researchers, and developers who need to process large text datasets for topic extraction, document similarity, or semantic analysis.

Value Proposition

Developers choose Gensim for its memory-efficient processing of corpora larger than RAM, intuitive APIs, and optimized implementations of popular algorithms. Its ability to handle streaming data and support distributed computing makes it uniquely suited for large-scale text analysis.

Overview

Topic Modelling for Humans

Use Cases

Best For

  • Extracting topics from large document collections like research papers or news articles
  • Building document similarity search engines for content recommendation
  • Processing text corpora that exceed available RAM with streaming algorithms
  • Implementing word2vec models for semantic word embeddings
  • Running distributed topic modeling across computer clusters
  • Analyzing customer feedback or reviews to identify common themes

Not Ideal For

  • Applications requiring real-time, low-latency text analysis with immediate responses
  • Projects that need the latest transformer-based models like BERT or GPT for contextual embeddings
  • Teams looking for an all-in-one NLP library with built-in preprocessing, tokenization, and linguistic tools

Pros & Cons

Pros

Memory-Independent Processing

Uses Python generators and iterators for streamed data processing, allowing corpora larger than RAM to be handled efficiently, as highlighted in the design goals.

Optimized Multicore Performance

Leverages BLAS libraries via NumPy for fast implementations of algorithms like LDA and word2vec, with multithreading support for improved speed on multicore systems.

Distributed Computing Capabilities

Supports running LSA and LDA on computer clusters, enabling scalable parallel processing for large-scale text analysis, as mentioned in the features.

Extensive Learning Resources

Provides comprehensive tutorials and Jupyter Notebook examples, making it easier for users to learn and apply topic modeling techniques effectively.

Cons

No New Feature Development

The project is in stable maintenance mode, meaning only bug and documentation fixes are accepted, so it won't incorporate latest NLP advancements or new algorithms.

Complex Setup Requirements

Installation depends on NumPy linked to a BLAS library, which can be non-trivial to configure, especially when building from source on platforms without pre-built wheels.

Limited Model Variety

Focuses primarily on unsupervised topic modeling and embeddings, lacking integrated support for supervised NLP tasks or modern neural network architectures beyond word2vec.

Frequently Asked Questions

Quick Stats

Stars16,397
Forks4,412
Contributors0
Open Issues399
Last commit5 months ago
CreatedSince 2011

Tags

#information-retrieval#python-library#text-analysis#data-science#natural-language-processing#python#topic-modeling#machine-learning#nlp#data-mining

Built With

B
BLAS
F
Fortran
P
Python
N
NumPy
C
C++

Links & Resources

Website

Included in

Machine Learning72.2k
Auto-fetched 1 day ago

Related Projects

PyTorch - Tensors and Dynamic neural networks in Python with strong GPU accelerationPyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Stars99,362
Forks27,568
Last commit1 day ago
keraskeras

Deep Learning for humans

Stars64,026
Forks19,761
Last commit1 day ago
streamlitstreamlit

Streamlit — A faster way to build and share data apps.

Stars44,318
Forks4,213
Last commit1 day ago
gradiogradio

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

Stars42,407
Forks3,409
Last commit1 day ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub