A bi-directional equivariant transformer for long-range DNA sequence modeling, enabling reverse-complement aware genomic analysis.
Caduceus is a bi-directional equivariant transformer model designed for long-range DNA sequence analysis. It processes genomic sequences up to 131k tokens while maintaining reverse-complement equivariance, ensuring predictions are consistent regardless of DNA strand orientation. The model addresses the need for scalable and biologically accurate representations in genomics.
Bioinformatics researchers and computational biologists developing deep learning models for genomic sequence analysis, variant effect prediction, and regulatory element classification.
Developers choose Caduceus for its unique combination of bi-directional context, long-range capability, and built-in reverse-complement equivariance, which provides more robust and biologically plausible representations compared to standard DNA language models.
Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Processes DNA sequences in both forward and reverse directions, capturing surrounding nucleotide information for more accurate modeling, as configurable via bidirectional_strategy settings in the training scripts.
Ensures consistent predictions regardless of DNA strand orientation, with Caduceus-PS variant designed for this without data augmentation, critical for biological accuracy in tasks like variant effect prediction.
Handles sequences up to 131,000 tokens, enabling analysis of extensive genomic regions, as demonstrated in pre-training on the human reference genome with max_length configurations.
Offers Caduceus-PS and Caduceus-Ph variants on Hugging Face for masked language modeling, allowing quick integration into existing pipelines without starting from scratch.
Provides configurable training pipelines for downstream tasks like GenomicBenchmarks and Nucleotide Transformer datasets, with options for conjoined training and testing to leverage equivariance.
Requires conda environment setup, manual data downloading from specific sources (e.g., hg38.ml.fa), and familiarity with slurm for cluster jobs, making it less accessible for quick experimentation.
Primarily targeted at genomic DNA sequences with no mention of support for RNA or protein data, and it relies on the HyenaDNA framework, which may limit community extensions or tooling.
Processing long sequences up to 131k tokens demands significant GPU memory and power, as indicated by batch size adjustments and distributed training scripts, which can be prohibitive for smaller teams.
The README focuses on reproducing paper results with detailed slurm scripts and config overrides, but lacks beginner-friendly tutorials or simplified deployment guides for new users.