Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Deep Learning
  3. DeepMind QA Corpus

DeepMind QA Corpus

Apache-2.0Python

Script to generate question/answer pairs from CNN and Daily Mail articles for machine reading comprehension research.

GitHubGitHub
1.3k stars240 forks0 contributors

What is DeepMind QA Corpus?

RC-Data is a script and dataset generation tool that creates question/answer pairs from CNN and Daily Mail news articles for machine reading comprehension research. It provides a benchmark dataset where models must answer questions based on article context, supporting the development of systems that can understand and reason about text. The project was introduced in the 'Teaching Machines to Read and Comprehend' paper from DeepMind.

Target Audience

Machine learning researchers and NLP practitioners working on question answering, reading comprehension, and text understanding tasks who need standardized datasets for model training and evaluation.

Value Proposition

Researchers choose RC-Data because it provides a large-scale, realistic benchmark for reading comprehension with carefully constructed question/answer pairs from real news articles. The dataset's entity anonymization and dual corpus support make it valuable for training robust models that generalize across different text sources.

Overview

Question answering dataset featured in "Teaching Machines to Read and Comprehend

Use Cases

Best For

  • Training machine reading comprehension models on news article text
  • Benchmarking question answering systems against standardized datasets
  • Research on entity-based reasoning in natural language processing
  • Developing models that understand context to answer questions
  • Creating educational resources for NLP dataset generation techniques
  • Studying the performance of deep learning models on text comprehension tasks

Not Ideal For

  • Projects requiring datasets with news articles from after 2015 for contemporary language modeling
  • Teams with limited storage capacity or without SSD drives to handle millions of small files
  • Developers using modern Python 3 toolchains who cannot maintain legacy Python 2.7 environments

Pros & Cons

Pros

Large-Scale Benchmark

Generates over a million question/answer pairs from CNN and Daily Mail articles, providing a substantial dataset for training robust machine reading comprehension models, as indicated by the script's output scale.

Entity Anonymization

Uses entity anonymization techniques to create cloze-style questions, forcing models to rely on contextual understanding rather than named entities, as detailed in the linked research paper.

Dual Corpus Diversity

Supports both CNN and Daily Mail articles, offering varied text sources and writing styles, which enhances dataset generalization for QA systems.

Preprocessed Fallback

Provides an alternative download link for processed datasets when the script fails, ensuring accessibility even if the Wayback Machine is down, as mentioned in the README.

Cons

Deprecated Dependencies

Requires Python 2.7 and specific library versions like libxml2 2.9.1, which are outdated and may conflict with modern development environments, complicating setup.

Fragile Data Sourcing

Relies on the Wayback Machine for article downloads, which can be unreliable with missing URLs, necessitating fallback to preprocessed data and adding uncertainty to the generation process.

Storage Intensive Output

Creates approximately 1,000,000 small files for the Daily Mail corpus, preferring an SSD, which can strain storage and I/O performance in resource-constrained setups.

Frequently Asked Questions

Quick Stats

Stars1,296
Forks240
Contributors0
Open Issues2
Last commit9 years ago
CreatedSince 2015

Tags

#deep-learning#question-answering#natural-language-processing#text-processing#reading-comprehension#research#dataset#machine-learning

Built With

w
wget
l
libxml2
v
virtualenv
P
Python

Included in

Deep Learning27.8kQuestion Answering767
Auto-fetched 1 hour ago

Related Projects

Fashion-MNISTFashion-MNIST

A MNIST-like fashion product database. Benchmark :point_down:

Stars12,723
Forks3,075
Last commit3 years ago
Open Images datasetOpen Images dataset

The Open Images dataset

Stars4,365
Forks606
Last commit4 years ago
LLVIPLLVIP

LLVIP: A Visible-infrared Paired Dataset for Low-light Vision

Stars819
Forks74
Last commit8 months ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub