Tools for compiling and using the Maluuba NewsQA dataset, a machine reading comprehension dataset based on CNN articles.
Maluuba NewsQA is a collection of tools for compiling and using the NewsQA dataset, a machine reading comprehension dataset based on CNN news articles. It provides scripts to generate the dataset in JSON and CSV formats with annotated question-answer pairs, enabling research in natural language understanding. The project addresses the challenge of legal distribution restrictions by allowing users to reconstruct the dataset from source files.
Researchers and developers working on machine reading comprehension, question answering systems, and natural language processing who need annotated datasets for model training and evaluation.
It offers a well-structured, crowdsourced dataset with multiple answer annotations and validation metrics, specifically designed for span-based question answering tasks. The tooling ensures reproducibility despite legal constraints on direct data distribution.
Tools for using Maluuba's NewsQA Dataset (public version)
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Includes character and token-based answer spans, consensus metrics, and validation flags, as shown in the JSON example with fields like 'consensus' and 'validatedAnswers' for precise answer localization.
Outputs data in both JSON and CSV formats, allowing integration into various processing pipelines, as mentioned in the Data Description section for different use cases.
Provides scripts to generate the dataset from source files, ensuring accessibility despite legal distribution limits, emphasized in the Philosophy section for research integrity.
Incorporates annotations from multiple crowdsourcers with metrics like isAnswerAbsent and isQuestionBad, offering insights into data quality and question reliability.
Manual setup mandates Python 2.7, which is deprecated and may cause compatibility issues with modern libraries, as specified in the Requirements section.
Requires downloading external files, setting up Docker or Conda environments, and handling Java dependencies for tokenization, adding significant overhead for users.
Cannot be distributed directly due to CNN copyright, forcing users to compile the dataset themselves, which adds extra steps and potential legal scrutiny.