An unsupervised machine learning approach to learn vector representations of molecular substructures for cheminformatics.
Mol2vec is an unsupervised machine learning approach that learns vector representations of molecular substructures, similar to how word2vec creates embeddings for words. It transforms molecules into sequences of substructure identifiers and trains a model to capture chemical relationships in a continuous vector space. This enables quantitative similarity analysis, featurization for predictive models, and exploration of chemical space without labeled data.
Cheminformatics researchers, computational chemists, and drug discovery scientists who need to represent molecules as numerical features for machine learning tasks.
Mol2vec provides chemically intuitive molecular embeddings using an unsupervised approach, eliminating the need for hand-crafted descriptors. It leverages natural language processing techniques to capture substructure relationships, offering a flexible and scalable way to featurize molecules for various predictive modeling applications.
Mol2vec - an unsupervised machine learning approach to learn vector representations of molecular substructures
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Learns embeddings without labeled data by treating molecules as sentences of substructure identifiers, capturing nuanced chemical relationships as validated in the original paper.
Handles massive datasets efficiently, with corpus generation for 20 million compounds taking 6 hours on 4 cores, making it suitable for big chemical libraries like ZINC.
Offers both command-line subcommands for corpus, train, and featurize, plus a Python module, enabling seamless use in scripts or pipelines for cheminformatics.
Generates consistent vector dimensions (e.g., 300D) for all molecules via featurization, directly usable as features in downstream machine learning models like scikit-learn.
Requires RDKit and multiple Python libraries (NumPy, gensim, etc.), which can be difficult to install and maintain, especially on non-Linux systems or in constrained environments.
Corpus generation and model training are time-intensive processes, taking hours even with parallelization, as noted in the performance sections, limiting rapid experimentation.
Relies on Morgan fingerprints with a fixed radius, which may not capture 3D molecular conformations or electronic properties, potentially missing key chemical insights.