A Python library for fast random access to chemical descriptors and molecule indices, optimized for machine learning workflows.
DescriptaStorus is a Python library for computing, storing, and accessing chemical descriptors, optimized for machine learning applications in cheminformatics. It provides fast random access to precomputed molecular properties and indexed molecule files, streamlining the preparation of chemical datasets for model training and evaluation.
Cheminformatics researchers, computational chemists, and data scientists working on molecular machine learning projects who need efficient descriptor management and reproducible workflows.
Developers choose DescriptaStorus for its performance-optimized storage and retrieval of chemical descriptors, its integration with RDKit, and its focus on reproducibility—enabling consistent descriptor generation across different environments.
Descriptor computation(chemistry) and (optional) storage for machine learning
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables quick retrieval of descriptor rows and indexed molecule data, optimized for machine learning pipelines as highlighted in the README's focus on efficiency.
Supports multiple RDKit-based generators like RDKit2D and Morgan counts, which can be combined for tailored descriptor sets, as shown in the MakeGenerator examples.
Includes validation mechanisms to ensure descriptor consistency across different hardware and software environments, addressing key reproducibility challenges in cheminformatics.
Provides low-level raw store tools to build custom descriptor stores with user-defined columns and data types, enabling bespoke data management.
Requires manual installation via git clone and depends on kyotocabinet for full indexing features, which adds complexity compared to standard pip packages.
The README offers basic examples but lacks comprehensive tutorials or API references, making advanced usage or troubleshooting difficult for newcomers.
Initial descriptor store creation can be time-consuming and resource-intensive for large datasets, as hinted in the TODO list's note on faster append-only store creation.