Question 1

How do I download source{d} datasets?

Accepted Answer

Access datasets via the GitHub repository links; for example, the Public Git Archive is 6TB and may require direct download or use of scripts. Check each dataset's page for specific instructions and ensure sufficient storage.

Question 2

source{d} datasets vs CodeSearchNet: which is better for ML on code?

Accepted Answer

source{d} datasets offer broader coverage like PR comments and Docker data, while CodeSearchNet focuses on code snippets. Choose based on your needs; for commit analysis or multi-domain research, source{d} is superior, but for pure code search, CodeSearchNet might suffice.

Question 3

Are source{d} datasets updated with new GitHub data?

Accepted Answer

No, the datasets are static snapshots up to 2019. For newer data, you'll need to collect it independently or look for alternative sources, as the project doesn't provide regular updates.

Question 4

How to use these datasets for training a code generation model?

Accepted Answer

Start with the Programming Language Identifiers dataset for token training and the Public Git Archive for context. Preprocess using the provided scripts, then integrate with frameworks like TensorFlow, focusing on the large-scale code examples for better model performance.

Question 5

What programming languages are in the identifiers dataset?

Accepted Answer

The dataset extracts from 10+ languages, though specifics aren't listed in the README. Refer to the dataset documentation or contact the maintainers for detailed language breakdowns and coverage.

Question 6

Can I contribute or extend these datasets?

Accepted Answer

Yes, contributions are welcome per CONTRIBUTING.md. You can propose new datasets or improvements, but note the focus is on reproducible, large-scale data for ML on code, so ensure alignment with the project's goals.

Public Git Archive

What is Public Git Archive?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions