Question 1

How to fine-tune Caduceus on my own genomic dataset?

Accepted Answer

You can adapt the provided training scripts by modifying the dataset configuration in train.py, specifying your data path, and adjusting parameters like batch_size and seq_len. The README shows examples for GenomicBenchmarks, but you'll need to preprocess your data to match the expected format, which may require custom coding.

Question 2

Caduceus vs HyenaDNA: which should I use for DNA sequence modeling?

Accepted Answer

Caduceus builds on HyenaDNA by adding bi-directionality and reverse-complement equivariance, making it better for tasks requiring biological symmetry like strand-aware predictions. HyenaDNA might be sufficient for simpler, non-equivariant tasks, but Caduceus offers more robust representations for genomics, as noted in its paper comparisons.

Question 3

What GPU memory is needed to run Caduceus on full 131k token sequences?

Accepted Answer

Running Caduceus on 131k sequences requires substantial GPU memory, likely multiple high-end GPUs with tens of GBs each, as hinted by the use of distributed training (torchrun) and batch size adjustments in the scripts. You may need to reduce sequence length or batch size for limited hardware.

Question 4

Can Caduceus handle RNA sequences or only DNA?

Accepted Answer

Caduceus is designed specifically for DNA sequences, with tokenization and pre-training focused on nucleotide bases (A, C, G, T). There's no mention of RNA support in the README, so adapting it would require retraining on RNA data and potentially modifying the tokenizer.

Question 5

How do I extract embeddings from Caduceus for downstream machine learning models?

Accepted Answer

Use the vep_embeddings.py script provided, which allows extraction of embeddings with options for reverse-complement handling. The README explains running it with torchrun for parallelism, and you can specify model paths and sequence lengths to fit your task.

Question 6

What's the practical difference between Caduceus-PS and Caduceus-Ph?

Accepted Answer

Caduceus-PS is reverse-complement equivariant by design, so no data augmentation is needed, while Caduceus-Ph uses reverse-complement augmentation during training. PS is better for consistency across strands, but Ph might offer more flexibility in some setups, as noted in the Hugging Face model descriptions.

Caduceus

What is Caduceus?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions