Question 1

How do I compile the NewsQA dataset from scratch?

Accepted Answer

Clone the repo, download the CNN stories and question tar.gz files to the maluuba/newsqa folder, then use Docker or manual Python 2.7 setup with Conda to run data_generator.py, as detailed in the README. This generates JSON and CSV files with train/dev/test splits.

Question 2

NewsQA vs SQuAD: which dataset is better for question answering research?

Accepted Answer

NewsQA focuses on CNN news with detailed crowdsourced annotations and validation metrics, ideal for studying answer span consistency. SQuAD covers Wikipedia and is more widely used; choose NewsQA for news-domain specificity but SQuAD for broader topic diversity and larger community support.

Question 3

What are the legal terms for using NewsQA in my project?

Accepted Answer

CNN articles are used by permission, and CNN does not waive ownership rights or endorse the project, as stated in the Legal section. Check LICENSE.txt and ensure compliance, especially for commercial use, which may require direct permission from CNN.

Question 4

How to tokenize the NewsQA data?

Accepted Answer

Install a JDK and Stanford JAR files (version 3.6.0) in the maluuba/newsqa folder, then run python maluuba/newsqa/data_generator.py. Token warnings are normal, and this outputs token-based indices in newsqa-data-tokenized-*.csv, as per the README.

Question 5

What does the 'isAnswerAbsent' field mean in the JSON?

Accepted Answer

It indicates the proportion of crowdsourcers who said there was no answer to the question in the story, helping assess question difficulty and data quality. This field, along with 'isQuestionBad', provides validation insights for model training.

Question 6

Can I use NewsQA with Python 3 without Docker?

Accepted Answer

The manual setup requires Python 2.7 due to encoding issues in the original stories, but after compilation, the JSON and CSV files can be loaded with any tools. For Python 3, Docker is recommended to avoid dependency conflicts.

NewsQA

What is NewsQA?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions