A curated list of publicly available medical datasets for machine learning, covering imaging, EHRs, literature, and speech.
Medical Data for Machine Learning is a curated GitHub repository that aggregates publicly available medical datasets for AI and ML research. It provides organized links to datasets across domains like medical imaging, electronic health records, biomedical literature, and speech data, saving researchers time in data discovery.
Machine learning researchers, data scientists, and healthcare AI developers who need access to diverse, real-world medical data for training and evaluating models.
It centralizes scattered medical data resources into a single, well-structured list, reducing the overhead of finding and vetting datasets. The repository is community-maintained and focuses on datasets that are actually usable for ML tasks.
This repository is a comprehensive, community-maintained collection of medical datasets suitable for machine learning research. It serves as a centralized resource to help researchers and developers discover and access diverse medical data without scouring multiple sources.
The project aims to democratize access to medical data for the machine learning community by aggregating and organizing disparate public resources, thereby accelerating research in healthcare AI.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Centralizes scattered medical resources across imaging (e.g., EchoNet-Dynamic, ADNI), EHRs (e.g., MIMIC-III), literature, and speech data, saving researchers time in discovery.
Includes direct references to datasets from Grand Challenges, Kaggle, and MICCAI events, facilitating benchmarking and participation in medical AI competitions.
Maintained as an open-source GitHub repository, allowing community contributions to keep the list updated with new and diverse datasets.
Specifically targets machine learning research by listing datasets with annotations and structures suitable for training models, such as annotated medical images and clinical notes.
Many datasets require separate registrations (e.g., ADNI, ABIDE) or are hosted on external sites, leading to potential access delays and fragmented user experience.
Datasets vary widely in formats, licensing, and preprocessing levels, requiring significant manual effort to integrate into ML pipelines.
As a curated list, it doesn't validate dataset quality, completeness, or privacy compliance, leaving researchers to independently assess each source.