A deep convolutional neural network that predicts RNA-seq coverage at 32bp resolution from DNA sequence.
Borzoi is a deep convolutional neural network that predicts RNA-seq coverage from DNA sequence. It takes 524kb input sequences and outputs predictions at 32bp resolution, modeling gene expression, splicing, and polyadenylation. The tool solves the problem of interpreting how genetic variants affect RNA processing by providing in silico predictions without requiring new experiments.
Computational biologists and bioinformaticians working on variant interpretation, functional genomics, and gene expression modeling. Researchers needing to predict the effects of SNPs, indels, or structural variants on RNA-seq profiles.
Developers choose Borzoi for its high-resolution predictions, extensive training on diverse genomic assays, and ready-to-use models for human and mouse. Its open-source implementation and detailed tutorials for variant scoring make it a practical tool for in silico genomics experiments.
RNA-seq prediction with deep convolutional neural networks.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Predicts RNA-seq coverage at 32bp resolution from 524kb DNA sequences, enabling fine-grained analysis of gene expression and splicing as highlighted in the model description.
Trained on a comprehensive collection from ENCODE, GTEx, CATlas, and FANTOM5, ensuring robust performance across diverse genomic assays, with detailed target lists provided.
Offers pre-trained models for human and mouse, including multiple replicates and mini-models for specific subsets like K562 RNA-seq, available for direct download.
Includes jupyter notebooks and scripts for scoring and visualizing the impact of SNPs, indels, and structural variants, with example notebooks for eQTLs, sQTLs, and more.
Requires installing three separate repositories (baskerville, borzoi, westminster) with specific Python 3.10 and TensorFlow 2.15.x versions, plus manual environment variable configuration.
Training data is over multiple TB and requires a billable GCP project for access, and some scripts depend on slurm for multi-processing, limiting accessibility for small labs.
Tutorials are described as 'minimal' and paper replication relies on an external repository, which may hinder users seeking in-depth guidance for custom analyses.