Scripts and tools to recreate the ELI5 dataset for long-form question answering research.
ELI5 is a research project that provides scripts and tools to recreate a dataset for long-form question answering (LFQA). It constructs a corpus by pairing explanatory questions and answers from the ELI5 subreddit with supporting documents from CommonCrawl, enabling the training of models that generate detailed, paragraph-length answers. The project addresses the need for high-quality, open datasets in the LFQA research community.
Researchers and practitioners in natural language processing, specifically those working on long-form question answering, dataset creation, or generative language models. It is also relevant for academics and engineers needing reproducible pipelines for large-scale text data processing.
ELI5 offers a transparent, script-based approach to dataset creation, allowing full control and customization over the data pipeline. Unlike static datasets, it provides tools to regenerate the dataset with updated sources or heuristics, and includes pretrained models and evaluation scripts to jumpstart LFQA experiments.
Scripts and links to recreate the ELI5 dataset.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides full scripts for downloading and processing Reddit and CommonCrawl data, ensuring transparency and allowing custom modifications to the dataset creation process.
Includes pretrained models, BPE encoding, and Fairseq-py training/evaluation scripts, reducing the barrier to entry for long-form QA experiments.
Part of established benchmarks like KILT and Dodecadialogue, facilitating standardized evaluation and comparison with other NLP models.
Allows users to tweak heuristics or update sources, unlike static datasets, offering control over data quality and relevance.
Requires a SLURM cluster, 100+ GB storage, and 48+ hours of compute, making it inaccessible for individual researchers or small teams without such resources.
The multi-step process involves manual recovery for interrupted threads and potential failures, as noted in the FAQ about SLURM instability.
Reddit data is capped at 2018, so it doesn't reflect current trends or information, limiting its use for contemporary applications.