A reading comprehension dataset with Wikipedia summaries, full stories, and question-answer pairs for narrative understanding.
NarrativeQA is a reading comprehension dataset created by DeepMind for evaluating machine understanding of entire narratives. It consists of documents with Wikipedia summaries, links to full stories (books and movie scripts), and corresponding question-answer pairs. The dataset challenges models to comprehend long-form text rather than just extract factual information from short passages.
Researchers and developers working on natural language processing, particularly in reading comprehension, question answering, and narrative understanding tasks. It's especially relevant for those building or evaluating models that need to understand long documents and complex storylines.
Unlike many QA datasets that focus on factoid extraction from short passages, NarrativeQA requires understanding of entire narratives, making it more challenging and realistic. It provides a benchmark for testing true reading comprehension capabilities in AI systems.
This repository contains the NarrativeQA dataset. It includes the list of documents with Wikipedia summaries, links to full stories, and questions and answers.
Specifically designed to test understanding of entire narratives, unlike datasets like SQuAD that use short passages, making it ideal for benchmarking advanced reading comprehension on complex stories.
Includes Wikipedia summaries and detailed metadata such as word counts and source information in documents.csv, providing additional context for model training and evaluation.
Offers tokenized versions of questions and answers in qaps.csv, easing integration into NLP pipelines and reducing preprocessing overhead for researchers.
Created by DeepMind with a peer-reviewed paper, ensuring reliability and academic rigor, which is evident from the structured files and citation guidelines.
Full stories are not included in the repository; users must download them separately using download_stories.sh, which can be time-consuming, prone to link rot, and requires manual verification with compare.sh.
The dataset primarily provides summaries and QA pairs, with stories hosted externally, making it less self-contained and adding steps for full utilization, unlike datasets that bundle all text.
Focuses on a fixed set of books and movies, which may not generalize to other domains like technical or real-time narratives, and lacks updates since its release.
Question answering dataset featured in "Teaching Machines to Read and Comprehend
Scripts and links to recreate the ELI5 dataset.
Tools for using Maluuba's NewsQA Dataset (public version)
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.