A biological foundation model for long-context DNA sequence modeling and design from single nucleotides to whole genomes.
Evo is a biological foundation model that understands and generates DNA sequences from the molecular to the genome scale. It treats DNA as a language, using a transformer-based architecture to model sequences at single-nucleotide resolution. The model solves the problem of computationally efficient, large-scale biological sequence design and analysis, enabling tasks like generating synthetic genes, CRISPR systems, and entire synthetic genomes.
Computational biologists, bioinformaticians, and synthetic biology researchers who need to design, analyze, or generate DNA sequences for research, therapeutic development, or industrial applications.
Developers choose Evo for its unique combination of long-context capability, byte-level precision, and proven application in generating functional biological sequences. Its open-source nature, availability via multiple interfaces (local, HuggingFace, API), and foundation model approach provide a versatile and powerful tool not matched by traditional bioinformatics software.
Biological foundation modeling from molecular to genome scale
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses StripedHyena architecture to handle contexts up to 131,072 tokens, enabling near-linear scaling for modeling entire genomes efficiently.
Successfully generated SynGenome, an AI-created database with over 100 billion base pairs of synthetic DNA, demonstrating practical application in de novo design.
Includes fine-tuned checkpoints for generating CRISPR-Cas systems and transposons, providing out-of-the-box capabilities for key synthetic biology workflows.
Available via local Python API, HuggingFace integration, and Together AI API, offering multiple interfaces for different research and deployment needs.
Requires FlashAttention-2 with specific GPU support and PyTorch versions, leading to installation challenges on unsupported systems.
Setup involves conda environments and dependency management issues, as noted with flash-attn library conflicts, increasing deployment complexity.
Pre-trained on prokaryotic genomes only, which may reduce generalization for eukaryotic or other biological contexts not covered in the OpenGenome dataset.