Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Computational Biology
  3. scBERT

scBERT

GPL-3.0Pythonv1.0.0

A BERT-based foundation model pretrained on large-scale scRNA-seq data for automated cell type annotation in single-cell analysis.

GitHubGitHub
357 stars69 forks0 contributors

What is scBERT?

scBERT is a BERT-based foundation model pretrained on large-scale single-cell RNA sequencing (scRNA-seq) data for automated cell type annotation. It addresses common challenges in scRNA-seq analysis, such as batch effects and reliance on curated marker gene lists, by leveraging deep learning to capture gene-gene interactions. The model follows a pre-train and fine-tune approach, enabling accurate annotation on user-specific datasets.

Target Audience

Bioinformaticians, computational biologists, and researchers working with single-cell RNA sequencing data who need reliable, scalable cell type annotation tools. It is suited for those familiar with deep learning frameworks and Python-based bioinformatics workflows.

Value Proposition

Developers choose scBERT because it provides a state-of-the-art, pretrained deep learning model specifically designed for scRNA-seq data, offering improved accuracy over traditional methods by effectively handling batch effects and leveraging latent gene interactions without requiring extensive manual curation.

Overview

scBERT is a deep learning model designed to address the challenges of cell type annotation in single-cell RNA sequencing (scRNA-seq) data. It leverages the pre-train and fine-tune paradigm to overcome issues like batch effects, reliance on curated marker genes, and inefficient use of gene-gene interaction information.

Key Features

  • BERT-based Architecture — Uses a transformer encoder (PerformerLM) pretrained on massive unlabeled scRNA-seq data to understand gene-gene interactions.
  • Pre-train and Fine-tune — Pretrained on large-scale data for general understanding, then fine-tuned on specific datasets for accurate cell annotation.
  • Novel Cell Type Detection — Includes functionality to detect novel cell types by thresholding predicted probabilities.
  • Batch Effect Handling — Designed to better manage batch effects compared to traditional annotation algorithms.
  • Scalable Inference — Can infer cell types for thousands of cells efficiently (e.g., ~25 minutes for 10,000 cells on a desktop).

Philosophy

scBERT applies the success of large-scale pretrained language models to computational biology, aiming to provide a robust, data-driven foundation for cell type annotation that reduces reliance on manually curated knowledge.

Use Cases

Best For

  • Automating cell type annotation in large-scale single-cell RNA-seq studies
  • Reducing reliance on manually curated marker gene lists for cell identification
  • Handling batch effects in multi-experiment scRNA-seq datasets
  • Detecting novel or rare cell types in single-cell data
  • Applying transformer-based deep learning to computational biology tasks
  • Fine-tuning pretrained models for specific scRNA-seq annotation projects

Not Ideal For

  • Researchers with small datasets or tight deadlines requiring quick, lightweight annotation tools
  • Teams lacking deep learning expertise or familiarity with PyTorch and bioinformatics pipelines
  • Clinical or medical projects needing validated, approved tools for diagnostic use
  • Users who prioritize interpretable, transparent algorithms over deep learning black boxes

Pros & Cons

Pros

Advanced Batch Effect Handling

scBERT is specifically designed to better manage batch effects compared to traditional annotation algorithms, as stated in the README, making it robust for multi-experiment datasets.

Novel Cell Type Detection

Includes built-in functionality to detect novel cell types by thresholding predicted probabilities, with a default threshold of 0.5, providing flexibility for exploratory analysis.

Scalable Inference Performance

Can efficiently infer cell types for thousands of cells, with the README citing ~25 minutes for 10,000 cells on a desktop, enabling large-scale studies.

Data-Driven Annotation Approach

Reduces reliance on manually curated marker genes by leveraging pretrained models on massive unlabeled scRNA-seq data, aligning with modern AI paradigms for improved accuracy.

Cons

Cumbersome Data Preprocessing

Requires specific steps like gene symbol revision according to NCBI Gene database and normalization with scanpy, adding complexity and potential for errors in the workflow.

Outdated Software Dependencies

Depends on older library versions such as torch 1.8.1, which may lead to compatibility issues, security vulnerabilities, and lack of access to newer features.

No Clinical Validation

Explicitly stated as not approved for clinical use in the disclaimer, limiting its applicability in medical research or diagnostic settings.

Frequently Asked Questions

Quick Stats

Stars357
Forks69
Contributors0
Open Issues24
Last commit2 years ago
CreatedSince 2021

Tags

#transformer#cell-type-annotation#deep-learning#single-cell-rna-seq#computational-biology#gene-expression#pretrained-models#bioinformatics

Built With

S
Scanpy
t
transformers
s
scikit-learn
p
pandas
P
Python
N
NumPy
P
PyTorch
S
SciPy

Included in

Computational Biology122
Auto-fetched 28 minutes ago

Related Projects

totalVItotalVI

Deep probabilistic analysis of single-cell and spatial omics data

Stars1,652
Forks466
Last commit1 day ago
scGPTscGPT

scGPT is a foundation model designed for single-cell multi-omics data analysis using generative AI. It leverages transformer architecture pretrained on millions of single-cell profiles to enable a wide range of downstream biological tasks, advancing computational biology by providing a powerful, unified model for cellular data. ## Key Features - **Pretrained Model Zoo** — Offers multiple organ-specific and whole-human models trained on millions of cells for various applications. - **Zero-Shot Applications** — Supports tasks like cell embedding and reference mapping without task-specific training. - **Reference Mapping** — Enables fast similarity search across millions of cells using efficient indexing with faiss. - **Multi-Task Fine-Tuning** — Can be adapted for scRNA-seq integration, cell type annotation, perturbation prediction, and GRN inference. - **Online Tools** — Provides accessible web applications for reference mapping, cell annotation, and GRN inference via cloud GPUs. ## Philosophy scGPT aims to build a foundational AI model for single-cell biology, democratizing access to advanced computational methods and accelerating discoveries in multi-omics research through open-source collaboration.

Stars1,592
Forks335
Last commit2 months ago
UNIUNI

Pathology Foundation Model - Nature Medicine

Stars752
Forks87
Last commit1 year ago
GigaPathGigaPath

Prov-GigaPath: A whole-slide foundation model for digital pathology from real-world data

Stars621
Forks104
Last commit1 year ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub