Question 1

How do I download and start using CORD-19 in Python?

Accepted Answer

Download the dataset from the S3 link in the README, extract files like metadata.csv, and use the provided Python example to load data and access full-text JSONs for analysis.

Question 2

CORD-19 or LitCovid: which is better for COVID-19 text mining?

Accepted Answer

CORD-19 offers more structure with full-text parses and embeddings, but LitCovid is maintained by NLM and may be more current; choose based on your need for preprocessed data versus ongoing updates.

Question 3

What are the SPECTER embeddings in CORD-19 used for?

Accepted Answer

They are 768-dimensional document embeddings precomputed using the SPECTER model, ideal for semantic similarity, clustering, and search tasks without needing to train your own embeddings.

Question 4

How to handle duplicate papers in CORD-19 metadata?

Accepted Answer

The README notes duplicate cord_uids exist due to multiple sources; implement custom deduplication logic, such as merging rows based on source priority, as no built-in solution is provided.

Question 5

Is CORD-19 still being updated in 2024?

Accepted Answer

No, the final release was on June 2, 2022, as stated in the README, so it's a historical dataset for research up to that date only.

Question 6

How can I extract tables and figures from CORD-19 papers?

Accepted Answer

Tables are partially parsed in JSONs under 'ref_entries', but figures are not included; you may need to use external tools or the original PDFs for complete extraction.

CORD-19

What is CORD-19?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions