A PyTorch system for open-domain question answering by retrieving and reading documents, originally applied to Wikipedia.
DrQA is a PyTorch-based open-domain question answering system that reads documents to answer factoid questions. It combines a document retriever to find relevant texts from a large corpus (like Wikipedia) with a neural document reader to extract precise answers. The system addresses the challenge of machine reading at scale, where answers must be located within potentially millions of unstructured documents.
Researchers and developers working on natural language processing, information retrieval, and question answering systems, particularly those interested in scalable, open-domain QA applications.
DrQA provides a proven, modular framework for open-domain QA with pre-trained models and efficient retrieval, offering a strong baseline for research and practical deployments. Its separation of retrieval and reading components allows flexibility and adaptation to different document collections.
Reading Wikipedia to Answer Open-Domain Questions
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses TF-IDF with hashed n-grams to quickly retrieve relevant documents from millions, achieving high precision (e.g., 78.0 P@5 on SQuAD) as shown in the pre-trained models table.
Multi-layer RNN trained on SQuAD provides competitive extractive QA, with the single model scoring 78.9 F1 on the dev set, enabling precise answer extraction from text.
Separates retriever and reader components, allowing easy swapping of tokenizers (CoreNLP, spaCy) and adaptation to different document collections beyond Wikipedia.
Includes models trained on SQuAD and with distant supervision, ready for deployment on Wikipedia or similar corpora without additional training, as highlighted in the download scripts.
Requires Java for CoreNLP tokenizer and multiple Python packages, making installation complex and error-prone, as noted in the tokenizer section with environment variable configurations.
TF-IDF model needs to be rebuilt for new documents, which is inefficient for dynamic corpora; the README admits it's designed for static dumps like the 2016 Wikipedia snapshot.
Built on PyTorch 1.0, which may lack features and optimizations of newer versions, potentially affecting compatibility and long-term maintenance.