Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Linguistics
  3. nlp-datasets

nlp-datasets

An alphabetical list of free and public domain text datasets for Natural Language Processing (NLP) tasks.

GitHubGitHub
6.0k stars992 forks0 contributors

What is nlp-datasets?

nlp-datasets is a curated, alphabetical directory of free and public domain text datasets specifically for Natural Language Processing. It aggregates links and descriptions for hundreds of datasets—from news corpora and social media archives to academic papers and product reviews—solving the problem of fragmented and hard-to-discover data sources for NLP practitioners.

Target Audience

NLP researchers, data scientists, machine learning engineers, and students who need raw text data for training models, conducting linguistic analysis, or benchmarking algorithms.

Value Proposition

Developers choose this resource because it saves significant time in dataset discovery and vetting by providing a centralized, well-organized list with practical metadata (size, source, description) across multiple languages and domains, all focused on being freely usable.

Overview

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)

Use Cases

Best For

  • Finding large-scale text corpora for training language models
  • Sourcing multilingual data for cross-lingual NLP projects
  • Locating domain-specific datasets (e.g., legal, medical, news) for specialized model training
  • Discovering social media text (Twitter, Reddit) for sentiment analysis research
  • Accessing historical text data for linguistic or sociological studies
  • Comparing and selecting datasets based on size and format before downloading

Not Ideal For

  • Projects requiring pre-annotated datasets for supervised learning tasks like NER or sentiment analysis
  • Teams needing real-time data access or APIs for dynamic NLP applications
  • Researchers looking for integrated data preprocessing or cleaning tools
  • Developers who prefer a searchable, interactive database over a static markdown list

Pros & Cons

Pros

Extensive Dataset Curation

Provides direct links and metadata for hundreds of datasets, from small collections like SMS Spam (200 KB) to massive crawls like Common Crawl (541 TB), eliminating the need to scour multiple sources.

Multi-Language Support

Includes datasets in English, German, Arabic, Albanian, Urdu, Kinyarwanda, and Kirundi, such as German political speeches and Kinyarwanda news articles, catering to multilingual NLP research.

Diverse Text Domains

Covers a wide range including news, social media, reviews, and academic papers, with examples like Amazon reviews, Reddit comments, and Enron emails, supporting varied model training.

Clear Source Attribution

References original providers like AWS, Kaggle, and CrowdFlower, and includes a dedicated sources section, ensuring transparency and further exploration.

Cons

Lacks Annotated Data Focus

Primarily lists raw, unstructured text; the README explicitly states that for annotated corpora or Treebanks, users should refer to other sources, limiting use for supervised learning.

Static and Manual Access

No API or search functionality; users must manually browse and download from external links, and some datasets require agreements or are on request, adding friction.

Potential Outdatedness

As a static GitHub repository, datasets may become unavailable or change over time without regular updates, risking broken links or obsolete data, as noted with some sources from years ago.

Frequently Asked Questions

Quick Stats

Stars5,974
Forks992
Contributors0
Open Issues4
Last commit3 years ago
CreatedSince 2016

Tags

#data-curation#multilingual#natural-language-processing#research#open-data#machine-learning

Included in

Linguistics436
Auto-fetched 1 day ago

Related Projects

NLP-progressNLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Stars22,973
Forks3,609
Last commit1 year ago
Awesome NLPAwesome NLP

:book: A curated list of resources dedicated to Natural Language Processing (NLP)

Stars18,431
Forks2,783
Last commit18 days ago
awesome-chinese-nlpawesome-chinese-nlp

A curated list of resources for Chinese NLP 中文自然语言处理相关资料

Stars7,934
Forks1,709
Last commit2 years ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub