How to install GuacaMol with RDKit on Windows?

Install via pip, but RDKit often requires separate installation via Conda or manual builds; for best results, use the Docker container as recommended in the README to avoid compatibility issues.

GuacaMol vs MOSES for molecular generation benchmarks?

GuacaMol focuses on both distribution-learning and goal-directed tasks with standardized ChEMBL datasets, while MOSES offers different benchmark suites; GuacaMol includes a leaderboard for community-wide comparison.

Can I use my own dataset with GuacaMol benchmarks?

Yes, but you must ensure it's pre-processed similarly to the ChEMBL datasets and handle forbidden symbols; the data generation script allows customization, though reproducibility may suffer without Docker.

How to run goal-directed benchmarks for a custom scoring function?

Subclass the GoalDirectedGenerator and implement the generate_optimized_molecules method, then call assess_goal_directed_generation with your model instance, as detailed in the benchmarking models section.

What are the baseline scores for distribution-learning tasks?

Baseline scores are provided in the linked paper and baseline implementations repository; the leaderboard at benevolent.com/guacamol shows current top performances for reference.

Is GuacaMol suitable for large-scale industrial drug discovery?

Yes, it's designed for rigorous evaluation in drug discovery, but integration with proprietary pipelines may require additional work due to its Python-centric and Docker-dependent nature.

Open-Awesome

GuacaMol

MITPython

A Python package for benchmarking generative models in de novo molecular design.

GitHub

525 stars99 forks0 contributors

What is GuacaMol?

GuacaMol is a Python package that provides benchmarks for evaluating generative models in de novo molecular design. It solves the problem of inconsistent evaluation in computational chemistry by offering standardized tests to measure how well models can generate novel, drug-like molecules. The package includes both distribution-learning and goal-directed benchmarks to assess different aspects of generative performance.

Target Audience

Computational chemists, machine learning researchers, and pharmaceutical scientists developing or using generative models for molecular design and drug discovery.

Value Proposition

Researchers choose GuacaMol because it provides rigorous, reproducible benchmarks that enable fair comparison of generative models, includes standardized datasets, and offers containerized environments for consistent evaluation across different systems.

Overview

Benchmarks for generative chemistry

Use Cases

Best For

Comparing different generative models for molecular design
Evaluating new algorithms for de novo drug discovery
Establishing performance baselines for generative chemistry research
Reproducible benchmarking in computational chemistry
Testing models against standardized chemical property optimization tasks
Academic research requiring transparent evaluation metrics

Not Ideal For

Projects requiring real-time molecular generation for interactive applications
Teams heavily invested in non-Python cheminformatics or machine learning frameworks
Small-scale academic projects lacking resources for Docker-based reproducible setups
Applications focused solely on molecular property prediction without generation tasks

Pros & Cons

Pros

Standardized Benchmarking Framework

Provides both distribution-learning and goal-directed benchmarks, enabling fair comparison of generative models as outlined in the benchmarking models section and the accompanying paper.

Reproducible Data and Environment

Includes pre-processed ChEMBL datasets and Docker support for consistent benchmarking, with detailed instructions in the Data and Docker sections to ensure reproducibility.

Baseline Model Implementations

Offers reference implementations of common generative models via a separate repository, helping establish performance baselines as mentioned in the Key Features and linked guacamol_baselines.

Community-Driven Leaderboard

Features a public leaderboard for transparent comparison of model performances, encouraging community engagement and progress tracking as highlighted in the README.

Cons

Complex Dependency Management

Requires specific versions of RDKit and FCD libraries, with pinned dependencies that can lead to installation conflicts, as noted in the installation section and change log updates.

Setup Overhead for Reproducibility

Docker is recommended for data generation and benchmarking, adding complexity for users unfamiliar with containerization, evident in the Docker commands and reproducibility warnings.

Limited Dataset Flexibility

Primarily relies on ChEMBL datasets, which may not cover all chemical spaces, and custom dataset integration requires handling forbidden symbols, as mentioned in the data generation and change log.

Frequently Asked Questions

Related Projects

MOSES

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Stars979

Forks280

Last commit2 years ago

TAPE (Tasks Assessing Protein Embeddings)

Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology.

Stars740

Forks135

Last commit3 years ago

ProteinGym

Official repository for the ProteinGym benchmarks

Stars442

Forks58

Last commit3 months ago

scIB (Single-cell Integration Benchmarks)

Benchmarking analysis of data integration tools

Stars423

Forks76

Last commit2 months ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub