Script to generate question/answer pairs from CNN and Daily Mail articles for machine reading comprehension research.
RC-Data is a script and dataset generation tool that creates question/answer pairs from CNN and Daily Mail news articles for machine reading comprehension research. It provides a benchmark dataset where models must answer questions based on article context, supporting the development of systems that can understand and reason about text. The project was introduced in the 'Teaching Machines to Read and Comprehend' paper from DeepMind.
Machine learning researchers and NLP practitioners working on question answering, reading comprehension, and text understanding tasks who need standardized datasets for model training and evaluation.
Researchers choose RC-Data because it provides a large-scale, realistic benchmark for reading comprehension with carefully constructed question/answer pairs from real news articles. The dataset's entity anonymization and dual corpus support make it valuable for training robust models that generalize across different text sources.
Question answering dataset featured in "Teaching Machines to Read and Comprehend
Generates over a million question/answer pairs from CNN and Daily Mail articles, providing a substantial dataset for training robust machine reading comprehension models, as indicated by the script's output scale.
Uses entity anonymization techniques to create cloze-style questions, forcing models to rely on contextual understanding rather than named entities, as detailed in the linked research paper.
Supports both CNN and Daily Mail articles, offering varied text sources and writing styles, which enhances dataset generalization for QA systems.
Provides an alternative download link for processed datasets when the script fails, ensuring accessibility even if the Wayback Machine is down, as mentioned in the README.
Requires Python 2.7 and specific library versions like libxml2 2.9.1, which are outdated and may conflict with modern development environments, complicating setup.
Relies on the Wayback Machine for article downloads, which can be unreliable with missing URLs, necessitating fallback to preprocessed data and adding uncertainty to the generation process.
Creates approximately 1,000,000 small files for the Daily Mail corpus, preferring an SSD, which can strain storage and I/O performance in resource-constrained setups.
A MNIST-like fashion product database. Benchmark :point_down:
The Open Images dataset
LLVIP: A Visible-infrared Paired Dataset for Low-light Vision
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.