How to fine-tune MolT5 on my own dataset?

Fine-tuning requires using the T5X framework with provided Gin configs from the configs/finetune directory. You must prepare your data in TFRecord format and define custom tasks similar to the caption2smiles and smiles2caption examples in the README.

MolT5 vs ChemBERTa for molecule tasks?

MolT5 is designed for translation between SMILES and natural language, enabling generation and captioning, while ChemBERTa focuses on molecular property prediction and classification. Use MolT5 for cross-modal tasks and ChemBERTa for downstream predictive analytics.

What datasets is MolT5 trained on?

MolT5 is pretrained on a mixture of ZINC (molecular data) and C4 (general text), and fine-tuned on ChEBI-20 for specific translation tasks. Datasets are provided in TFRecord and txt formats, as listed in the README.

How accurate is MolT5 for generating SMILES from text?

Accuracy depends on the model size and fine-tuning; the large model is optimized for caption2smiles but may produce invalid SMILES requiring chemical validation. The README cites metrics like BLEU and ROUGE from the EMNLP paper for evaluation.

Can MolT5 handle stereochemistry in SMILES strings?

Yes, since SMILES are processed as text, the model can interpret stereochemistry symbols, but performance on complex stereochemical descriptions is not explicitly validated in the README, so outputs should be checked with cheminformatics tools.

What hardware is needed to run MolT5-large?

Inference requires a GPU with at least 16GB VRAM for the large model, and training demands even more resources. The T5X-based checkpoints and HuggingFace integration imply dependency on TensorFlow or PyTorch with substantial memory.

MolT5

BSD-3-ClausePython

A T5-based model for bidirectional translation between molecular structures (SMILES) and natural language descriptions.

GitHub

What is MolT5?

MolT5 is a machine learning model that translates between molecular structures (represented as SMILES strings) and natural language descriptions. It solves the problem of bridging chemical informatics with natural language understanding, enabling automated molecule captioning and text-based molecule generation. The model is based on the T5 architecture and is trained on datasets like ChEBI-20 and ZINC.

Target Audience

Researchers and developers in computational chemistry, cheminformatics, and NLP who need tools for molecule description, generation, or cross-modal learning between chemical structures and text.

Value Proposition

MolT5 provides a unified, open-source framework for bidirectional molecule-language translation using state-of-the-art transformer models. It offers pretrained and fine-tuned checkpoints that are easily accessible via HuggingFace, reducing the barrier to applying advanced NLP techniques to chemical tasks.

Overview

Associated Repository for "Translation between Molecules and Natural Language"

Use Cases

Best For

Automatically generating descriptive captions for molecular structures

Related Projects

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub

194 stars20 forks0 contributors

Creating molecular structures from textual descriptions in research papers

Building educational tools that explain chemistry concepts with molecule visualizations

Data augmentation for cheminformatics datasets by generating synthetic molecule-description pairs

Cross-modal retrieval between chemical databases and scientific literature

Prototyping applications that require interaction between chemical and natural language domains

Not Ideal For

Real-time applications requiring low-latency molecule translation due to large model inference times
Projects focused on 3D molecular properties or conformations, as SMILES is a 2D representation
Domains with specialized chemical vocabulary not in ChEBI-20, such as inorganic or polymer chemistry
Teams lacking GPU resources for training or inference with models up to 800M parameters

Pros & Cons

Pros

Bidirectional Translation Capability

Supports both molecule captioning (SMILES to text) and text-based generation (text to SMILES), as shown in the README's example code for smiles2caption and caption2smiles tasks.

HuggingFace Integration

Pretrained and fine-tuned checkpoints are available on HuggingFace, enabling easy use with the Transformers library without manual setup, as demonstrated in the usage examples.

Scalable Model Sizes

Offers small (~77M), base (~250M), and large (~800M) parameter checkpoints, allowing users to trade off performance for computational efficiency based on their needs.

T5 Architecture Foundation

Leverages the established T5 framework for sequence-to-sequence tasks, benefiting from proven NLP techniques and transfer learning capabilities.

Cons

Complex Pretraining Setup

Pretraining requires using the T5X framework with custom task definitions like zinc_span_corruption, involving additional configuration and data preprocessing, which is not straightforward for beginners.

Limited Documentation for Customization

The README provides basic usage but sparse guidance on modifying architectures or training on new datasets, relying on users to navigate T5X and seqio libraries independently.

High Computational Demands

Large models necessitate significant GPU memory and processing power, making them impractical for resource-constrained environments or rapid prototyping without access to high-end hardware.

Frequently Asked Questions

Home

Computational Biology

BioGPT

BioGPT is a generative pre-trained transformer model specifically designed for biomedical text generation and mining. It leverages large-scale biomedical literature to understand and generate domain-specific text, enabling advanced natural language processing applications in healthcare and life sciences. ## Key Features - **Biomedical Pre-training** — Trained on PubMed abstracts and articles for domain-specific language understanding. - **Text Generation** — Generates coherent biomedical text, such as research summaries or hypothesis descriptions. - **Relation Extraction** — Identifies relationships between biomedical entities like drug-target interactions. - **Question Answering** — Answers biomedical questions based on contextual knowledge from literature. - **Document Classification** — Classifies biomedical documents into relevant categories. - **Hugging Face Integration** — Available through the transformers library for easy deployment and experimentation. ## Philosophy BioGPT focuses on bridging the gap between general-purpose language models and domain-specific needs by providing a model that understands the nuances and terminology of biomedical literature.

Stars4,489

Forks481

Last commit2 years ago

ClawBio

🦖 ClawBio - The first bioinformatics-native AI agent skill library. Local-first. Reproducible. Open. Free.

Code and data for GeneGPT.

Stars428

Forks34

Last commit1 year ago

GenePT

GenePT is a foundation model for single-cell biology that leverages ChatGPT embeddings of NCBI gene descriptions to perform gene-level and cell-level tasks. It offers an efficient alternative to traditional models that require extensive data curation and resource-intensive training from gene expression profiles. ## Key Features - **Gene Embeddings** — Uses GPT-3.5 embeddings of NCBI gene summary texts to represent genes. - **Cell Embeddings** — Generates single-cell embeddings by averaging gene embeddings weighted by expression or creating sentence embeddings from ordered gene names. - **Efficient Approach** — Eliminates the need for dataset curation and additional pre-training, making it user-friendly. - **Competitive Performance** — Achieves comparable or superior performance to existing single-cell foundation models in tasks like gene property classification and cell type annotation. - **Pre-computed Data** — Provides readily available datasets including extracted NCBI gene summaries and pre-computed OpenAI embeddings. ## Philosophy GenePT demonstrates that using large language model embeddings of scientific literature is a straightforward and effective approach for developing biological foundation models, complementing traditional expression-based methods.

Stars321

Forks47

Last commit2 years ago

#transformer

#cheminformatics

#natural-language-processing

#computational-chemistry