How do I set up DrQA with the CoreNLP tokenizer?

Download Stanford CoreNLP jars, set the CLASSPATH environment variable, and run the install_corenlp.sh script. Verify with the Python code snippet in the README to ensure the tokenizer loads correctly.

DrQA vs BERT for question answering: which is better?

DrQA is a full pipeline with retrieval and reading, ideal for open-domain QA over large corpora, while BERT excels at comprehension tasks. DrQA's strength is scalability, but BERT-based models may offer higher accuracy on pure reading benchmarks.

Can DrQA handle questions about current events?

No, because it uses a static Wikipedia dump from 2016. To handle recent events, you'd need to update the document database and retrain the models, which requires significant effort and computational resources.

How to train DrQA on my own documents?

Preprocess documents into the SQLite database format, build a new TF-IDF retriever using the retriever scripts, and fine-tune the reader on your data with distant supervision or SQuAD-style training, as described in the component READMEs.

What are DrQA's performance benchmarks on SQuAD?

The single model achieves 69.4 EM and 78.9 F1 on SQuAD dev, while the multitask model scores 29.5 EM in the full Wikipedia setting, as detailed in the trained models section with specific evaluation metrics.

Is DrQA suitable for commercial applications?

It can serve as a baseline for QA systems, but consider its limitations: static retrieval, dependency on large resources, and extractive-only answers, which may not meet all production needs without customization.

Open-Awesome

DrQA

NOASSERTIONPython

A PyTorch system for open-domain question answering by retrieving and reading documents, originally applied to Wikipedia.

GitHub

4.5k stars885 forks0 contributors

What is DrQA?

DrQA is a PyTorch-based open-domain question answering system that reads documents to answer factoid questions. It combines a document retriever to find relevant texts from a large corpus (like Wikipedia) with a neural document reader to extract precise answers. The system addresses the challenge of machine reading at scale, where answers must be located within potentially millions of unstructured documents.

Target Audience

Researchers and developers working on natural language processing, information retrieval, and question answering systems, particularly those interested in scalable, open-domain QA applications.

Value Proposition

DrQA provides a proven, modular framework for open-domain QA with pre-trained models and efficient retrieval, offering a strong baseline for research and practical deployments. Its separation of retrieval and reading components allows flexibility and adaptation to different document collections.

Overview

Reading Wikipedia to Answer Open-Domain Questions

Use Cases

Best For

Building scalable question answering systems over large document collections
Research on machine reading and open-domain QA benchmarks
Extracting factual answers from Wikipedia or similar corpora
Educational projects demonstrating retrieval-based QA pipelines
Prototyping QA applications with pre-trained models
Experimenting with distant supervision for NLP training

Not Ideal For

Applications requiring real-time or low-latency question answering due to computational overhead from retrieval and neural inference
Projects needing answers that involve multi-hop reasoning, generation, or are not directly extractable as text spans
Environments with limited computational resources or strict memory constraints, given the large models and database requirements
Use cases with rapidly changing document collections where frequent retriever updates are impractical

Pros & Cons

Pros

Efficient Scalable Retrieval

Uses TF-IDF with hashed n-grams to quickly retrieve relevant documents from millions, achieving high precision (e.g., 78.0 P@5 on SQuAD) as shown in the pre-trained models table.

Accurate Neural Comprehension

Multi-layer RNN trained on SQuAD provides competitive extractive QA, with the single model scoring 78.9 F1 on the dev set, enabling precise answer extraction from text.

Modular Flexible Design

Separates retriever and reader components, allowing easy swapping of tokenizers (CoreNLP, spaCy) and adaptation to different document collections beyond Wikipedia.

Pre-trained for Immediate Use

Includes models trained on SQuAD and with distant supervision, ready for deployment on Wikipedia or similar corpora without additional training, as highlighted in the download scripts.

Cons

Heavy Dependency Setup

Requires Java for CoreNLP tokenizer and multiple Python packages, making installation complex and error-prone, as noted in the tokenizer section with environment variable configurations.

Static Retrieval Limitations

TF-IDF model needs to be rebuilt for new documents, which is inefficient for dynamic corpora; the README admits it's designed for static dumps like the 2016 Wikipedia snapshot.

Outdated Framework Version

Built on PyTorch 1.0, which may lack features and optimizations of newer versions, potentially affecting compatibility and long-term maintenance.

Frequently Asked Questions

Related Projects

HuggingFace Transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

结巴中文分词

Last commit1 year ago

spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars33,637

Forks4,686

Last commit20 days ago

Haystack

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

Stars25,487

Forks2,832

Last commit3 days ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub