A corpus of academic papers about COVID-19 and related coronavirus research for text mining and NLP.
CORD-19 is a large-scale dataset of academic papers about COVID-19 and coronaviruses, curated to support text mining and natural language processing research. It provides structured metadata, full-text parses, and precomputed document embeddings to help researchers analyze scientific literature efficiently. The dataset was actively updated during the pandemic to include the latest research.
Researchers and data scientists working in NLP, biomedical informatics, or computational linguistics who need access to structured COVID-19 literature for analysis, modeling, or tool development.
It offers a clean, machine-readable corpus with preprocessed embeddings and versioned releases, saving researchers time on data collection and cleaning. Its association with Semantic Scholar ensures quality curation and integration with broader academic resources.
Get started with CORD-19
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Contains over 1 million papers with full text for nearly 370,000, providing an extensive corpus for training NLP models on biomedical text.
Includes precomputed SPECTER document embeddings for semantic search and analysis, saving time on feature extraction.
Offers metadata in CSV and full-text parses in JSON, facilitating easy programmatic integration and analysis, as shown in the example Python script.
Provides weekly releases with detailed changelogs, enabling reproducible research and trend analysis over time.
Acknowledges parsing errors, noise, and duplicate entries (e.g., multiple cord_uids per paper), requiring additional cleaning for reliable use.
Final release was in June 2022, so it misses recent research developments and is not suitable for current studies.
Schema changes over versions and inconsistencies in metadata (e.g., duplicate rows) complicate data handling, as noted in the FAQs.