How do I download datasets from nlp-datasets?

You need to browse the alphabetical list in the README, click on the provided links to external sources like AWS or Kaggle, and follow their download instructions—there's no direct download from nlp-datasets itself.

nlp-datasets vs Hugging Face Datasets: which should I use?

nlp-datasets is a curated list of free, raw datasets for discovery, while Hugging Face Datasets offers pre-processed, ready-to-use datasets with APIs. Choose nlp-datasets for finding public domain sources; Hugging Face for integrated, model-ready data.

Are the datasets in nlp-datasets labeled for machine learning?

Most are raw, unstructured text, but some like CrowdFlower sentiment analysis datasets include labels. For fully annotated corpora, the project points to other sources, so check each dataset's description carefully.

What languages are covered besides English?

It includes datasets in German, Arabic, Albanian, Urdu, Kinyarwanda, and Kirundi, with specific entries such as Albanian news articles and German court decisions, supporting diverse linguistic research.

How up-to-date is the nlp-datasets list?

The repository appears to be a static collection with no regular update schedule; it relies on community contributions, so some links might be outdated or datasets may have changed since listing.

Can I contribute to or update nlp-datasets?

Yes, as an open-source GitHub project, you can submit pull requests to add or update datasets, but there's no automated process, so changes depend on maintainer approval and manual curation.

Open-Awesome

nlp-datasets

An alphabetical list of free and public domain text datasets for Natural Language Processing (NLP) tasks.

GitHub

6.0k stars992 forks0 contributors

What is nlp-datasets?

nlp-datasets is a curated, alphabetical directory of free and public domain text datasets specifically for Natural Language Processing. It aggregates links and descriptions for hundreds of datasets—from news corpora and social media archives to academic papers and product reviews—solving the problem of fragmented and hard-to-discover data sources for NLP practitioners.

Target Audience

NLP researchers, data scientists, machine learning engineers, and students who need raw text data for training models, conducting linguistic analysis, or benchmarking algorithms.

Value Proposition

Developers choose this resource because it saves significant time in dataset discovery and vetting by providing a centralized, well-organized list with practical metadata (size, source, description) across multiple languages and domains, all focused on being freely usable.

Overview

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)

Use Cases

Best For

Finding large-scale text corpora for training language models
Sourcing multilingual data for cross-lingual NLP projects
Locating domain-specific datasets (e.g., legal, medical, news) for specialized model training
Discovering social media text (Twitter, Reddit) for sentiment analysis research
Accessing historical text data for linguistic or sociological studies
Comparing and selecting datasets based on size and format before downloading

Not Ideal For

Projects requiring pre-annotated datasets for supervised learning tasks like NER or sentiment analysis
Teams needing real-time data access or APIs for dynamic NLP applications
Researchers looking for integrated data preprocessing or cleaning tools
Developers who prefer a searchable, interactive database over a static markdown list

Pros & Cons

Pros

Extensive Dataset Curation

Provides direct links and metadata for hundreds of datasets, from small collections like SMS Spam (200 KB) to massive crawls like Common Crawl (541 TB), eliminating the need to scour multiple sources.

Multi-Language Support

Includes datasets in English, German, Arabic, Albanian, Urdu, Kinyarwanda, and Kirundi, such as German political speeches and Kinyarwanda news articles, catering to multilingual NLP research.

Diverse Text Domains

Covers a wide range including news, social media, reviews, and academic papers, with examples like Amazon reviews, Reddit comments, and Enron emails, supporting varied model training.

Clear Source Attribution

References original providers like AWS, Kaggle, and CrowdFlower, and includes a dedicated sources section, ensuring transparency and further exploration.

Cons

Lacks Annotated Data Focus

Primarily lists raw, unstructured text; the README explicitly states that for annotated corpora or Treebanks, users should refer to other sources, limiting use for supervised learning.

Static and Manual Access

No API or search functionality; users must manually browse and download from external links, and some datasets require agreements or are on request, adding friction.

Potential Outdatedness

As a static GitHub repository, datasets may become unavailable or change over time without regular updates, risking broken links or obsolete data, as noted with some sources from years ago.

Frequently Asked Questions

Related Projects

NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Stars22,973

Forks3,609

Last commit1 year ago

Awesome NLP

:book: A curated list of resources dedicated to Natural Language Processing (NLP)

Stars18,431

Forks2,783

Last commit18 days ago

awesome-chinese-nlp

A curated list of resources for Chinese NLP 中文自然语言处理相关资料

Stars7,934

Forks1,709

Last commit2 years ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub