A collection of genomic language models for predicting variant effects and evolutionary constraints from DNA sequences.
GPN (Genomic Pre-trained Network) is a collection of deep learning models that apply natural language processing techniques to genomic DNA sequences. It treats DNA as a language to predict the functional impact of genetic variants, model evolutionary constraints, and understand gene regulation across multiple species. The framework includes several specialized architectures for different genomic analysis tasks.
Computational biologists, bioinformaticians, and genomics researchers who need to predict variant effects, analyze evolutionary conservation, or build custom genomic language models for specific organisms.
GPN provides state-of-the-art genomic language models with multiple specialized architectures, extensive pre-trained models for various species, and a complete framework for training custom models on new genomic data. It's published in top-tier journals and integrates seamlessly with the HuggingFace ecosystem.
Genomic Pre-trained Network
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Includes GPN (single-sequence), GPN-MSA, PhyloGPN, and GPN-Star variants for different tasks like evolutionary modeling or alignment-free analysis, as detailed in the Modeling frameworks table.
All models are available on HuggingFace Model Hub, allowing easy loading with transformers.AutoModel and access to benchmark datasets, as shown in the quick start examples.
Offers pre-trained models for multiple species (human, mouse, fly, plants) with published benchmarks on clinical datasets like ClinVar and COSMIC, ensuring reliability.
Provides workflows for fine-tuning on custom genomic data, exemplified by the sorghum gene expression prediction model and training instructions with Snakemake.
Models like GPN-Star require whole-genome alignments for training and inference, which are large, difficult to obtain, and involve specialized preprocessing steps not trivial for all organisms.
Training and inference commands use torchrun with multiple GPUs, bf16 precision, and large batch sizes, making it resource-intensive and unsuitable for environments without robust hardware.
The README points to GitHub issues and discussions for training on non-standard species, indicating gaps in comprehensive guides for custom applications beyond the provided examples.
GPN-MSA is marked as deprecated in favor of GPN-Star, which could disrupt workflows for users invested in the older architecture and require migration efforts.