Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Machine Learning
  3. DrQA

DrQA

NOASSERTIONPython

A PyTorch system for open-domain question answering by retrieving and reading documents, originally applied to Wikipedia.

GitHubGitHub
4.5k stars885 forks0 contributors

What is DrQA?

DrQA is a PyTorch-based open-domain question answering system that reads documents to answer factoid questions. It combines a document retriever to find relevant texts from a large corpus (like Wikipedia) with a neural document reader to extract precise answers. The system addresses the challenge of machine reading at scale, where answers must be located within potentially millions of unstructured documents.

Target Audience

Researchers and developers working on natural language processing, information retrieval, and question answering systems, particularly those interested in scalable, open-domain QA applications.

Value Proposition

DrQA provides a proven, modular framework for open-domain QA with pre-trained models and efficient retrieval, offering a strong baseline for research and practical deployments. Its separation of retrieval and reading components allows flexibility and adaptation to different document collections.

Overview

Reading Wikipedia to Answer Open-Domain Questions

Use Cases

Best For

  • Building scalable question answering systems over large document collections
  • Research on machine reading and open-domain QA benchmarks
  • Extracting factual answers from Wikipedia or similar corpora
  • Educational projects demonstrating retrieval-based QA pipelines
  • Prototyping QA applications with pre-trained models
  • Experimenting with distant supervision for NLP training

Not Ideal For

  • Applications requiring real-time or low-latency question answering due to computational overhead from retrieval and neural inference
  • Projects needing answers that involve multi-hop reasoning, generation, or are not directly extractable as text spans
  • Environments with limited computational resources or strict memory constraints, given the large models and database requirements
  • Use cases with rapidly changing document collections where frequent retriever updates are impractical

Pros & Cons

Pros

Efficient Scalable Retrieval

Uses TF-IDF with hashed n-grams to quickly retrieve relevant documents from millions, achieving high precision (e.g., 78.0 P@5 on SQuAD) as shown in the pre-trained models table.

Accurate Neural Comprehension

Multi-layer RNN trained on SQuAD provides competitive extractive QA, with the single model scoring 78.9 F1 on the dev set, enabling precise answer extraction from text.

Modular Flexible Design

Separates retriever and reader components, allowing easy swapping of tokenizers (CoreNLP, spaCy) and adaptation to different document collections beyond Wikipedia.

Pre-trained for Immediate Use

Includes models trained on SQuAD and with distant supervision, ready for deployment on Wikipedia or similar corpora without additional training, as highlighted in the download scripts.

Cons

Heavy Dependency Setup

Requires Java for CoreNLP tokenizer and multiple Python packages, making installation complex and error-prone, as noted in the tokenizer section with environment variable configurations.

Static Retrieval Limitations

TF-IDF model needs to be rebuilt for new documents, which is inefficient for dynamic corpora; the README admits it's designed for static dumps like the 2016 Wikipedia snapshot.

Outdated Framework Version

Built on PyTorch 1.0, which may lack features and optimizations of newer versions, potentially affecting compatibility and long-term maintenance.

Frequently Asked Questions

Quick Stats

Stars4,469
Forks885
Contributors0
Open Issues55
Last commit2 years ago
CreatedSince 2017

Tags

#information-retrieval#neural-networks#question-answering#natural-language-processing#wikipedia#reading-comprehension#pytorch

Built With

S
SQLite
S
Stanford CoreNLP
s
spaCy
P
Python
P
PyTorch

Included in

Machine Learning72.2k
Auto-fetched 1 day ago

Related Projects

HuggingFace TransformersHuggingFace Transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Stars161,404
Forks33,441
Last commit1 day ago
jiebajieba

结巴中文分词

Stars35,004
Forks6,697
Last commit1 year ago
spacyspacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars33,637
Forks4,686
Last commit20 days ago
HaystackHaystack

Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.

Stars25,487
Forks2,832
Last commit3 days ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub