A diffusion framework for controllable protein sequence and evolutionary alignment generation using discrete diffusion models.
EvoDiff is an open-source diffusion framework for generating protein sequences and evolutionary alignments using discrete diffusion models. It enables controllable protein design by combining evolutionary-scale data with diffusion-based generation, allowing researchers to create novel proteins with desired functional or structural properties. The framework addresses limitations of structure-based models by operating directly in sequence space, making it possible to design proteins with intrinsically disordered regions.
Computational biologists, protein engineers, and bioinformatics researchers who need to generate novel protein sequences or design proteins with specific functional motifs, disordered regions, or evolutionary constraints.
EvoDiff provides a flexible, sequence-first approach to protein design that goes beyond traditional structure-based methods, offering both unconditional and conditional generation capabilities. Its unique value lies in leveraging evolutionary information through MSAs and enabling the design of proteins with intrinsically disordered regions, which are inaccessible to most existing protein design tools.
Generation of protein sequences and evolutionary alignments via discrete diffusion models
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Conditions on multiple sequence alignments (MSAs) and functional motifs for targeted generation, enabling design of proteins with specific evolutionary or functional properties as shown in conditional tasks like scaffolding and IDR inpainting.
Generates proteins directly in sequence space, allowing creation of intrinsically disordered regions (IDRs) inaccessible to structure-based models, expanding design beyond the structure-function paradigm.
Offers multiple diffusion schemes (OADM, D3PM) and model sizes (38M, 640M parameters) for different use cases, trained on UniRef50 and OpenFold datasets for varied computational needs.
Includes analysis scripts for metrics like self-consistency and RMSD using tools like Omegafold and ESM-IF, facilitating rigorous assessment of generated sequences.
Requires specific Python versions, PyTorch installation, and dependencies like torch-scatter, plus manual downloading of datasets (e.g., UniRef50, OpenFold), leading to potential setup headaches and compatibility issues.
Evaluation relies on third-party tools (e.g., TM-score, ProteinMPNN) that need separate installation and setup, adding significant complexity to the workflow beyond core generation.
Multiple model types and parameters require careful tuning, and the documentation assumes prior knowledge of diffusion models and bioinformatics, making it less accessible for newcomers.