A benchmark for evaluating protein language models through five biologically relevant semi-supervised learning tasks.
TAPE (Tasks Assessing Protein Embeddings) is a benchmark and toolkit for evaluating protein language models. It provides a set of five biologically relevant downstream tasks—such as secondary structure prediction and contact prediction—to assess how well learned protein embeddings capture functional and structural information. The project includes pretrained models, datasets, and training/evaluation code to standardize comparisons in protein representation learning.
Bioinformatics researchers and machine learning scientists working on protein sequence modeling who need to benchmark their models against standardized biological tasks.
TAPE offers a unified, extensible framework for evaluating protein embeddings across multiple biological domains, with pretrained models and curated datasets that reduce implementation overhead. Its focus on biologically meaningful tasks makes it more relevant for real-world applications than generic language modeling benchmarks.
Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses a HuggingFace-style API for seamless loading of pretrained models like ProteinBERT and UniRep, with automatic downloading and caching to simplify workflow.
Offers five standardized downstream tasks spanning secondary structure, contact prediction, and fluorescence, providing a holistic evaluation of protein embeddings.
Designed for easy addition of new models and tasks, with examples in the repository to guide community contributions and adaptations.
Includes LMDB and raw JSON formats for all tasks and pretraining data, reducing data preprocessing time and ensuring consistency.
The README explicitly warns against using TAPE's training utilities, as they are not updated for new PyTorch versions, forcing reliance on external frameworks like Pytorch Lightning.
Some documentation is missing, with users directed to open issues for clarification, which can slow down onboarding and troubleshooting.
This PyTorch version is not fully compatible with the original TensorFlow code, making it unsuitable for direct reproduction of the paper's results without extra effort.