A practical guide for researchers on how to properly structure and share data with statisticians to ensure efficient analysis.
Data Sharing is a guide created by the Leek group to help researchers and collaborators share data effectively with statisticians. It outlines a structured approach to delivering raw data, tidy datasets, code books, and processing instructions to avoid common pitfalls and speed up analysis. The guide focuses on reproducibility, clear documentation, and efficient collaboration between data collectors and analysts.
Researchers, students, postdocs, and collaborators across disciplines who need to share data with statisticians or data scientists for analysis, especially those unfamiliar with best practices in data preparation.
It provides a concrete, step-by-step framework that reduces delays in data analysis by emphasizing tidy data principles, reproducibility, and clear communication, which are often overlooked in ad-hoc data sharing.
The Leek group guide to data sharing
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Promotes Hadley Wickham's tidy data principles, ensuring each variable is a column and each observation a row, which simplifies analysis and reduces errors.
Recommends providing explicit scripts or pseudocode for data processing, enabling others to replicate analyses from raw to tidy data, as highlighted in the reproducibility section.
Includes detailed requirements for code books covering variables, units, and study design, which minimizes misunderstandings between collaborators.
Acknowledges that statisticians should handle raw data but advocates for pre-processing to speed up collaboration, based on real-world experience from the Leek group.
Focuses on principles without providing concrete examples beyond R and Excel, leaving users unsure how to implement steps with modern data tools.
The guide requires extensive manual documentation and tidying, which can be burdensome for large datasets or fast-paced projects needing quick insights.
Does not address how to handle big data or automate sharing processes, limiting its usefulness for teams with advanced data engineering needs.