Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Computational Biology
  3. GPN (Genomic Pre-trained Network)

GPN (Genomic Pre-trained Network)

MITJupyter Notebook0.7

A collection of genomic language models for predicting variant effects and evolutionary constraints from DNA sequences.

Visit WebsiteGitHubGitHub
347 stars48 forks0 contributors

What is GPN (Genomic Pre-trained Network)?

GPN (Genomic Pre-trained Network) is a collection of deep learning models that apply natural language processing techniques to genomic DNA sequences. It treats DNA as a language to predict the functional impact of genetic variants, model evolutionary constraints, and understand gene regulation across multiple species. The framework includes several specialized architectures for different genomic analysis tasks.

Target Audience

Computational biologists, bioinformaticians, and genomics researchers who need to predict variant effects, analyze evolutionary conservation, or build custom genomic language models for specific organisms.

Value Proposition

GPN provides state-of-the-art genomic language models with multiple specialized architectures, extensive pre-trained models for various species, and a complete framework for training custom models on new genomic data. It's published in top-tier journals and integrates seamlessly with the HuggingFace ecosystem.

Overview

Genomic Pre-trained Network

Use Cases

Best For

  • Predicting pathogenicity of missense variants in human genetics
  • Analyzing evolutionary conservation across vertebrate species
  • Fine-tuning language models for specific plant or animal genomes
  • Predicting gene expression levels from DNA sequence
  • Identifying functional non-coding regulatory elements
  • Benchmarking variant effect prediction methods on clinical datasets

Not Ideal For

  • Projects requiring quick, out-of-the-box variant analysis without genomic alignment data preparation
  • Teams with limited GPU resources or no access to high-performance computing infrastructure
  • Applications focused solely on protein sequence analysis or non-DNA biological data
  • Clinical pipelines needing FDA-approved or extensively validated tools for immediate diagnostic use

Pros & Cons

Pros

Multiple Specialized Architectures

Includes GPN (single-sequence), GPN-MSA, PhyloGPN, and GPN-Star variants for different tasks like evolutionary modeling or alignment-free analysis, as detailed in the Modeling frameworks table.

Seamless HuggingFace Integration

All models are available on HuggingFace Model Hub, allowing easy loading with transformers.AutoModel and access to benchmark datasets, as shown in the quick start examples.

Extensive Pre-trained Models

Offers pre-trained models for multiple species (human, mouse, fly, plants) with published benchmarks on clinical datasets like ClinVar and COSMIC, ensuring reliability.

Transfer Learning Support

Provides workflows for fine-tuning on custom genomic data, exemplified by the sorghum gene expression prediction model and training instructions with Snakemake.

Cons

Complex Data Dependency

Models like GPN-Star require whole-genome alignments for training and inference, which are large, difficult to obtain, and involve specialized preprocessing steps not trivial for all organisms.

High Computational Overhead

Training and inference commands use torchrun with multiple GPUs, bf16 precision, and large batch sizes, making it resource-intensive and unsuitable for environments without robust hardware.

Incomplete Documentation for Edge Cases

The README points to GitHub issues and discussions for training on non-standard species, indicating gaps in comprehensive guides for custom applications beyond the provided examples.

Deprecated Model Confusion

GPN-MSA is marked as deprecated in favor of GPN-Star, which could disrupt workflows for users invested in the older architecture and require migration efforts.

Frequently Asked Questions

Quick Stats

Stars347
Forks48
Contributors0
Open Issues3
Last commit26 days ago
CreatedSince 2022

Tags

#variant-effect-prediction#transformer-models#deep-learning#dna-sequencing#language-model#genomics#dna#evolutionary-biology#bioinformatics#huggingface

Built With

W
Weights & Biases
t
transformers
H
HuggingFace
P
Python
P
PyTorch

Links & Resources

Website

Included in

Computational Biology122
Auto-fetched 7 hours ago

Related Projects

EvoEvo

Biological foundation modeling from molecular to genome scale

Stars1,524
Forks178
Last commit3 months ago
Nucleotide TransformerNucleotide Transformer

Foundation Models for Genomics & Transcriptomics

Stars891
Forks95
Last commit4 months ago
HyenaDNAHyenaDNA

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena

Stars793
Forks107
Last commit1 year ago
DNABERTDNABERT

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Stars761
Forks179
Last commit5 months ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub