A collection of large-scale datasets for source code analysis and machine learning on code, including GitHub repositories, identifiers, and commit data.
source{d} Datasets is a collection of large-scale, curated datasets specifically designed for source code analysis and machine learning on code applications. It provides preprocessed data from GitHub, DockerHub, and other sources, including repositories, identifiers, commit messages, and PR reviews, enabling researchers to focus on model development rather than data collection.
Researchers and data scientists working on machine learning for software engineering, source code analysis, or AI-assisted programming tools who need large, clean datasets for training and evaluation.
It offers a unique, comprehensive set of datasets that are already cleaned, structured, and ready for analysis, saving significant time and effort compared to collecting and preprocessing raw data from disparate sources.
source{d} datasets ("big code") for source code analysis and machine learning on source code
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
The Public Git Archive includes 260k+ repositories and ~28 billion lines of code, providing immense training data for robust ML models, as noted in the README.
Covers identifiers, commit messages, PR reviews, and Docker metadata, enabling comprehensive software engineering research across multiple domains, detailed in the dataset list.
Datasets are cleaned and structured, such as the 49M distinct identifiers extracted from 10+ languages, saving significant time on raw data collection and preprocessing.
Includes scripts to regenerate datasets, ensuring transparency and allowing customization, as mentioned in the repository's tools and scripts section.
Most datasets are static up to 2019, like commit messages till March 2019 and PR comments from 2015-2018, limiting relevance for recent software trends.
Datasets like the 6TB Public Git Archive require substantial infrastructure, which can be prohibitive for individuals or teams with limited resources.
Reproducing datasets involves running provided scripts without detailed ease-of-use guidance, potentially requiring advanced setup and dependency management.