A long-range genomic foundation model that processes DNA sequences up to 1 million nucleotides at single nucleotide resolution.
HyenaDNA is a long-range genomic foundation model that processes DNA sequences up to 1 million nucleotides at single nucleotide resolution. It is designed to handle ultra-long genomic contexts, enabling tasks like classification, prediction, and in-context learning on DNA data. The model is pretrained on the human reference genome and can be fine-tuned for various downstream genomic applications.
Genomics researchers, bioinformaticians, and machine learning practitioners working with DNA sequence data who need to model long-range dependencies and fine-grained nucleotide interactions.
HyenaDNA uniquely combines extreme sequence length capability (up to 1M tokens) with single nucleotide resolution, outperforming existing models on long-range genomic tasks. Its open-source implementation and pretrained weights lower the barrier to applying deep learning to genomics.
Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Handles sequences up to 1 million tokens, enabling analysis of full chromosomes or large genomic regions, as highlighted in the intro and key features.
Processes DNA at individual base level for fine-grained analysis, allowing precise genomic feature detection as stated in the description.
Offers multiple model sizes on HuggingFace pretrained on hg38, reducing initialization time for downstream tasks, with GPU requirements specified.
Supports various downstream tasks like species classification and chromatin profiling, with example configs and dataloaders provided in the README.
Requires Docker or manual installation of dependencies like Flash Attention, and familiarity with Pytorch Lightning and Hydra, making onboarding challenging.
Large models need powerful GPUs (e.g., A100 for 1M sequences in Colab paid tier), and pretraining or fine-tuning can be computationally intensive.
The repo is described as a 'work in progress,' with users needing to dig into code for custom dataloaders, and experimental features like bidirectional implementation are not fully supported.
Assumes advanced ML knowledge, with custom configs and dataloaders required for new datasets, as noted in sections on setting up downstream experiments.