A T5-based model for bidirectional translation between molecular structures (SMILES) and natural language descriptions.
MolT5 is a machine learning model that translates between molecular structures (represented as SMILES strings) and natural language descriptions. It solves the problem of bridging chemical informatics with natural language understanding, enabling automated molecule captioning and text-based molecule generation. The model is based on the T5 architecture and is trained on datasets like ChEBI-20 and ZINC.
Researchers and developers in computational chemistry, cheminformatics, and NLP who need tools for molecule description, generation, or cross-modal learning between chemical structures and text.
MolT5 provides a unified, open-source framework for bidirectional molecule-language translation using state-of-the-art transformer models. It offers pretrained and fine-tuned checkpoints that are easily accessible via HuggingFace, reducing the barrier to applying advanced NLP techniques to chemical tasks.
Associated Repository for "Translation between Molecules and Natural Language"
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports both molecule captioning (SMILES to text) and text-based generation (text to SMILES), as shown in the README's example code for smiles2caption and caption2smiles tasks.
Pretrained and fine-tuned checkpoints are available on HuggingFace, enabling easy use with the Transformers library without manual setup, as demonstrated in the usage examples.
Offers small (~77M), base (~250M), and large (~800M) parameter checkpoints, allowing users to trade off performance for computational efficiency based on their needs.
Leverages the established T5 framework for sequence-to-sequence tasks, benefiting from proven NLP techniques and transfer learning capabilities.
Pretraining requires using the T5X framework with custom task definitions like zinc_span_corruption, involving additional configuration and data preprocessing, which is not straightforward for beginners.
The README provides basic usage but sparse guidance on modifying architectures or training on new datasets, relying on users to navigate T5X and seqio libraries independently.
Large models necessitate significant GPU memory and processing power, making them impractical for resource-constrained environments or rapid prototyping without access to high-end hardware.