State-of-the-art pre-trained transformer language models for protein sequences, enabling tasks like structure prediction and function annotation.
ProtTrans is a suite of pre-trained transformer-based language models specifically designed for protein sequences. It treats amino acid sequences as a language, enabling models to learn rich representations that capture structural and functional properties. These embeddings can be used for a wide range of bioinformatics tasks, such as predicting protein structure, function, and interactions.
Bioinformaticians, computational biologists, and machine learning researchers working on protein-related problems who need high-quality embeddings or want to fine-tune models for specific prediction tasks.
ProtTrans offers state-of-the-art performance on key benchmarks, provides a variety of model architectures and sizes, and is fully integrated with the Hugging Face ecosystem for easy use. Its models are openly available and have been validated in numerous downstream applications, from variant effect prediction to protein design.
ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Benchmarks in the README show ProtT5-XL-UniRef50 achieves up to 87% Q3 accuracy on secondary structure prediction, outperforming other models like ESM in various tasks.
Offers multiple transformer variants (e.g., ProtT5, ProtBERT, ProtAlbert) trained on different datasets (UniRef50, BFD), providing flexibility for specific research needs.
Models are directly accessible via the Transformers library, simplifying installation and embedding extraction with standard Python code, as shown in the Quick Start section.
Supports a wide range of tasks including feature extraction, fine-tuning with LoRA, prediction, sequence generation, and visualization, covering key bioinformatics workflows.
Models like ProtT5-XL require GPUs for efficient inference; the README explicitly states CPU usage is 'much slower' and not recommended, limiting accessibility for resource-constrained setups.
Sections for fine-tuning, visualization, and benchmarking note 'More information coming soon,' leaving gaps for users trying to implement advanced features without external guidance.
The README warns of tokenizer changes in Hugging Face that require workarounds (e.g., installing protobuf or setting legacy=True), adding complexity to setup and maintenance.
Exclusively designed for protein sequences, so it cannot be applied to other biological data types like genomics or metabolomics without significant adaptation.
AlphaFold 3 inference pipeline.
Evolutionary Scale Modeling (esm): Pretrained language models for proteins
Official repository for the Boltz biomolecular interaction models
Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.