How to run ELI5 on a single machine without a cluster?

The README explicitly states that single-machine use is impractical for full dataset recreation due to CommonCrawl processing; however, you can access pre-processed data via the Hugging Face nlp library for lighter experimentation.

ELI5 vs SQuAD: which is better for long-form question answering?

ELI5 focuses on generative, paragraph-length answers with supporting documents from CommonCrawl, while SQuAD is for extractive QA on Wikipedia. Use ELI5 for tasks requiring detailed explanations and document retrieval.

Can I fine-tune the ELI5 pretrained model on custom data?

Yes, the project provides scripts for data formatting and BPE application with Fairseq-py, but you'll need to adapt your data to the same structure and ensure compatibility with the multi-task labeling.

What are the hardware requirements for rebuilding the ELI5 dataset?

You need a SLURM cluster with around 100 threads, over 100GB of disk space, and at least 48 hours of compute time, as detailed in the CommonCrawl download and processing steps.

Is ELI5 compatible with Hugging Face Transformers?

The dataset is available through the Hugging Face nlp library, but the original modeling code uses Fairseq-py. For Transformer-based approaches, you might need to convert the data or use the Hugging Face integration.

How to handle errors if SLURM jobs fail during data creation?

The README includes troubleshooting steps, such as checking finished threads and relaunching specific slices, but this requires manual intervention and can be time-consuming.

Open-Awesome

ELI5

NOASSERTIONPython

Scripts and tools to recreate the ELI5 dataset for long-form question answering research.

GitHub

324 stars42 forks0 contributors

What is ELI5?

ELI5 is a research project that provides scripts and tools to recreate a dataset for long-form question answering (LFQA). It constructs a corpus by pairing explanatory questions and answers from the ELI5 subreddit with supporting documents from CommonCrawl, enabling the training of models that generate detailed, paragraph-length answers. The project addresses the need for high-quality, open datasets in the LFQA research community.

Target Audience

Researchers and practitioners in natural language processing, specifically those working on long-form question answering, dataset creation, or generative language models. It is also relevant for academics and engineers needing reproducible pipelines for large-scale text data processing.

Value Proposition

ELI5 offers a transparent, script-based approach to dataset creation, allowing full control and customization over the data pipeline. Unlike static datasets, it provides tools to regenerate the dataset with updated sources or heuristics, and includes pretrained models and evaluation scripts to jumpstart LFQA experiments.

Overview

Scripts and links to recreate the ELI5 dataset.

Use Cases

Best For

Researching long-form question answering models
Building custom datasets from Reddit and CommonCrawl
Training transformer-based generative models
Reproducing academic experiments in NLP
Benchmarking multi-task learning approaches
Developing explainable AI systems for text generation

Not Ideal For

Researchers without access to SLURM clusters or high-performance computing resources
Projects requiring up-to-date or real-time data beyond the 2011-2018 Reddit scope
Teams seeking a quick, plug-and-play dataset without multi-day processing and setup

Pros & Cons

Pros

Reproducible Data Pipeline

Provides full scripts for downloading and processing Reddit and CommonCrawl data, ensuring transparency and allowing custom modifications to the dataset creation process.

Comprehensive Modeling Support

Includes pretrained models, BPE encoding, and Fairseq-py training/evaluation scripts, reducing the barrier to entry for long-form QA experiments.

Benchmark Integration

Part of established benchmarks like KILT and Dodecadialogue, facilitating standardized evaluation and comparison with other NLP models.

Flexible Data Customization

Allows users to tweak heuristics or update sources, unlike static datasets, offering control over data quality and relevance.

Cons

Heavy Infrastructure Dependency

Requires a SLURM cluster, 100+ GB storage, and 48+ hours of compute, making it inaccessible for individual researchers or small teams without such resources.

Complex and Error-Prone Setup

The multi-step process involves manual recovery for interrupted threads and potential failures, as noted in the FAQ about SLURM instability.

Outdated Data Limitations

Reddit data is capped at 2018, so it doesn't reflect current trends or information, limiting its use for contemporary applications.

Frequently Asked Questions

Related Projects

DeepMind QA Corpus

Question answering dataset featured in "Teaching Machines to Read and Comprehend

Stars1,296

Forks239

Last commit9 years ago

NarrativeQA

This repository contains the NarrativeQA dataset. It includes the list of documents with Wikipedia summaries, links to full stories, and questions and answers.

Stars518

Forks71

Last commit6 years ago

NewsQA

Tools for using Maluuba's NewsQA Dataset (public version)

Stars257

Forks56

Last commit3 years ago

GraphQuestions

A characteristic-rich dataset for factoid question answering described in the paper "On Generating Characteristic-rich Question Sets for QA Evaluation" - EMNLP'16

Stars94

Forks14

Last commit3 years ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub

ELI5

NOASSERTIONPython

Scripts and tools to recreate the ELI5 dataset for long-form question answering research.

GitHub

324 stars42 forks0 contributors

What is ELI5?

Target Audience

Value Proposition

Overview

Scripts and links to recreate the ELI5 dataset.

Use Cases

Best For

Researching long-form question answering models
Building custom datasets from Reddit and CommonCrawl
Training transformer-based generative models
Reproducing academic experiments in NLP
Benchmarking multi-task learning approaches
Developing explainable AI systems for text generation

Not Ideal For

Researchers without access to SLURM clusters or high-performance computing resources
Projects requiring up-to-date or real-time data beyond the 2011-2018 Reddit scope
Teams seeking a quick, plug-and-play dataset without multi-day processing and setup

Pros & Cons

Pros

Reproducible Data Pipeline

Provides full scripts for downloading and processing Reddit and CommonCrawl data, ensuring transparency and allowing custom modifications to the dataset creation process.

Comprehensive Modeling Support

Includes pretrained models, BPE encoding, and Fairseq-py training/evaluation scripts, reducing the barrier to entry for long-form QA experiments.

Benchmark Integration

Part of established benchmarks like KILT and Dodecadialogue, facilitating standardized evaluation and comparison with other NLP models.

Flexible Data Customization

Allows users to tweak heuristics or update sources, unlike static datasets, offering control over data quality and relevance.

Cons

Heavy Infrastructure Dependency

Requires a SLURM cluster, 100+ GB storage, and 48+ hours of compute, making it inaccessible for individual researchers or small teams without such resources.

Complex and Error-Prone Setup

The multi-step process involves manual recovery for interrupted threads and potential failures, as noted in the FAQ about SLURM instability.

Outdated Data Limitations

Reddit data is capped at 2018, so it doesn't reflect current trends or information, limiting its use for contemporary applications.

Frequently Asked Questions

Related Projects

DeepMind QA Corpus

Question answering dataset featured in "Teaching Machines to Read and Comprehend

Stars1,296

Forks239

Last commit9 years ago

NarrativeQA

This repository contains the NarrativeQA dataset. It includes the list of documents with Wikipedia summaries, links to full stories, and questions and answers.

Stars518

Forks71

Last commit6 years ago

NewsQA

Tools for using Maluuba's NewsQA Dataset (public version)

Stars257

Forks56

Last commit3 years ago

GraphQuestions

A characteristic-rich dataset for factoid question answering described in the paper "On Generating Characteristic-rich Question Sets for QA Evaluation" - EMNLP'16

Stars94

Forks14

Last commit3 years ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub