A curated collection of data sets and tools for empirical software engineering and mining software repositories research.
Awesome Empirical Software Engineering is a curated repository of data sets and tools for conducting evidence-based, data-driven research on software systems. It provides resources for mining software repositories, analyzing code quality, and studying software evolution, supporting the field of empirical software engineering.
Academic researchers, PhD students, and data scientists focused on software engineering research, particularly those studying software evolution, code quality, repository mining, and empirical methods.
It offers a centralized, community-maintained collection of high-quality data sets and specialized tools, saving researchers time in data collection and enabling more reproducible studies compared to gathering resources individually.
A curated repository of software engineering repository mining data sets
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
The README lists over 20 specific datasets like GHTorrent, Defects4J, and Unix history, providing diverse, real-world software engineering data for research on commits, bugs, and code evolution.
Includes tools such as PyDriller for Git analysis and RefactoringMiner for detecting code changes, offering ready-to-use frameworks that simplify repository mining tasks mentioned in the Tools section.
Links to key research outlets like the MSR conference and Empirical Software Engineering journal, directly supporting the academic community by highlighting relevant conferences and publications.
Actively encourages contributions via a guide and email support, as noted in the README, ensuring the list evolves with new resources and stays current through crowd-sourced updates.
It's merely a curated list; users must independently set up, configure, and maintain the tools and datasets, which can involve complex dependencies and learning curves not addressed here.
The README admits the list requires continuous improvement and contributions, so some resources may be outdated or lack recent updates, relying on community vigilance for accuracy.
Focuses heavily on open-source and academic resources, making it less suitable for industries needing proprietary datasets or tools with commercial support, as highlighted in the research-oriented content.