Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Computational Biology
  3. Molecular Transformer

Molecular Transformer

NOASSERTIONPython

A sequence-to-sequence transformer model for predicting chemical reaction pathways (retrosynthesis) with uncertainty calibration.

GitHubGitHub
426 stars83 forks0 contributors

What is Molecular Transformer?

Molecular Transformer is a sequence-to-sequence neural network model that predicts chemical reaction outcomes and retrosynthetic pathways. It treats molecules as SMILES strings and uses transformer architecture to translate between reactants and products, helping chemists design synthesis routes faster. The model includes uncertainty estimation to indicate prediction confidence.

Target Audience

Computational chemists, researchers in cheminformatics, and organic chemists who need AI tools for reaction prediction and retrosynthesis planning.

Value Proposition

It provides an open-source, uncertainty-calibrated model trained on public reaction datasets, unlike proprietary tools. The integration with RDKit for data preprocessing and availability of pre-trained models lowers the barrier for academic and industrial adoption.

Overview

Molecular Transformer is a neural machine translation model adapted for chemistry that predicts chemical reaction outcomes and retrosynthetic pathways. It translates between molecular representations (SMILES strings) to forecast how molecules react or how target molecules can be synthesized, accelerating discovery in organic chemistry and drug development.

Key Features

  • Retrosynthesis Prediction — Predicts reactant molecules needed to synthesize a target product molecule.
  • Uncertainty Calibration — Provides confidence estimates for predictions, helping chemists assess reliability.
  • SMILES Tokenization — Uses custom tokenization of SMILES strings to treat molecules as sequences for transformer models.
  • Data Augmentation — Doubles training data by generating random equivalent SMILES representations via RDKit.
  • Pre-trained Models — Includes models trained on public datasets (USPTO_MIT, USPTO_STEREO) with mixed or separated reactant/reagent formats.

Philosophy

Molecular Transformer aims to make AI-assisted chemical reaction prediction accessible to organic chemists, with the goal of integrating these models into daily laboratory workflows to accelerate molecular discovery.

Use Cases

Best For

  • Predicting reactants for a target molecule in retrosynthesis analysis
  • Estimating confidence scores for chemical reaction predictions
  • Academic research on AI-driven reaction prediction models
  • Data augmentation for chemical reaction datasets using SMILES randomization
  • Benchmarking new machine learning approaches against published USPTO dataset results
  • Integrating reaction prediction into automated synthesis planning pipelines

Not Ideal For

  • Teams requiring real-time, high-throughput reaction prediction in production pipelines
  • Chemists seeking drag-and-drop interfaces without coding or ML expertise
  • Projects focused on non-organic or novel reaction types outside USPTO patent data

Pros & Cons

Pros

Uncertainty Calibration

Provides confidence estimates for predictions, explicitly mentioned in the README to help chemists assess reliability, which is rare in open-source models.

Pre-trained Models

Includes models trained on public datasets like USPTO_MIT and USPTO_STEREO, available for download, allowing immediate use without training from scratch.

Data Augmentation

Doubles training data by generating random equivalent SMILES via RDKit, as described in the README, improving model robustness and accuracy.

RDKit Integration

Utilizes RDKit for SMILES canonicalization and tokenization, ensuring accurate molecular representation and preprocessing, which is critical for chemistry applications.

Cons

Outdated Dependencies

Requires Python 3.5 and PyTorch 0.4.1, which are obsolete and may cause compatibility issues with modern systems or libraries, as noted in the installation steps.

Complex Setup and Workflow

Involves multi-step conda environment setup, data preprocessing, and model averaging (last 20 checkpoints), making it inaccessible for non-experts without deep ML or chemistry knowledge.

Limited Domain Generalization

Trained primarily on USPTO patent data, so predictions may falter for reactions outside this domain, as admitted in the README regarding the need for more diverse data on IBM RXN.

Frequently Asked Questions

Quick Stats

Stars426
Forks83
Contributors0
Open Issues2
Last commit4 years ago
CreatedSince 2018

Tags

#transformer-model#neural-machine-translation#chemical-informatics#rdkit#chemistry#smiles#machine-learning#pytorch

Built With

C
Conda
R
RDKit
P
Python
P
PyTorch

Included in

Computational Biology122
Auto-fetched 4 hours ago

Related Projects

DiffDockDiffDock

Implementation of DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

Stars1,534
Forks355
Last commit1 year ago
JTVAEJTVAE

Junction Tree Variational Autoencoder for Molecular Graph Generation (ICML 2018)

Stars561
Forks196
Last commit3 years ago
REINVENTREINVENT

REINVENT is a reinforcement learning framework specifically designed for de novo drug design, enabling the generation of novel molecular structures with optimized properties. It addresses the challenge of discovering new chemical entities by combining generative models with property prediction to explore chemical space efficiently. ## Key Features - **Reinforcement Learning Pipeline** — Uses RL to optimize molecular structures toward desired chemical properties and biological activities - **De Novo Molecular Generation** — Creates entirely new molecular entities rather than modifying existing compounds - **Property Optimization** — Incorporates scoring functions to guide generation toward molecules with specific target properties - **Template-Based Execution** — Provides configurable JSON templates for different running modes and experiments - **TensorBoard Integration** — Enables real-time monitoring and visualization of training logs and progress ## Philosophy REINVENT applies reinforcement learning principles to drug discovery, treating molecular generation as an optimization problem where the agent learns to propose molecules that maximize desired chemical and biological properties.

Stars375
Forks114
Last commit1 year ago
TargetDiffTargetDiff

The official implementation of 3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction (ICLR 2023)

Stars343
Forks53
Last commit2 years ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub