An alphabetical list of free and public domain text datasets for Natural Language Processing (NLP) tasks.
nlp-datasets is a curated, alphabetical directory of free and public domain text datasets specifically for Natural Language Processing. It aggregates links and descriptions for hundreds of datasets—from news corpora and social media archives to academic papers and product reviews—solving the problem of fragmented and hard-to-discover data sources for NLP practitioners.
NLP researchers, data scientists, machine learning engineers, and students who need raw text data for training models, conducting linguistic analysis, or benchmarking algorithms.
Developers choose this resource because it saves significant time in dataset discovery and vetting by providing a centralized, well-organized list with practical metadata (size, source, description) across multiple languages and domains, all focused on being freely usable.
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides direct links and metadata for hundreds of datasets, from small collections like SMS Spam (200 KB) to massive crawls like Common Crawl (541 TB), eliminating the need to scour multiple sources.
Includes datasets in English, German, Arabic, Albanian, Urdu, Kinyarwanda, and Kirundi, such as German political speeches and Kinyarwanda news articles, catering to multilingual NLP research.
Covers a wide range including news, social media, reviews, and academic papers, with examples like Amazon reviews, Reddit comments, and Enron emails, supporting varied model training.
References original providers like AWS, Kaggle, and CrowdFlower, and includes a dedicated sources section, ensuring transparency and further exploration.
Primarily lists raw, unstructured text; the README explicitly states that for annotated corpora or Treebanks, users should refer to other sources, limiting use for supervised learning.
No API or search functionality; users must manually browse and download from external links, and some datasets require agreements or are on request, adding friction.
As a static GitHub repository, datasets may become unavailable or change over time without regular updates, risking broken links or obsolete data, as noted with some sources from years ago.