An alphabetical list of free and public domain text datasets for Natural Language Processing (NLP) tasks.
nlp-datasets is a curated, alphabetical directory of free and public domain text datasets specifically for Natural Language Processing. It aggregates links and descriptions for hundreds of datasets—from news corpora and social media archives to academic papers and product reviews—solving the problem of fragmented and hard-to-discover data sources for NLP practitioners.
NLP researchers, data scientists, machine learning engineers, and students who need raw text data for training models, conducting linguistic analysis, or benchmarking algorithms.
Developers choose this resource because it saves significant time in dataset discovery and vetting by providing a centralized, well-organized list with practical metadata (size, source, description) across multiple languages and domains, all focused on being freely usable.
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
Provides direct links and metadata for hundreds of datasets, from small collections like SMS Spam (200 KB) to massive crawls like Common Crawl (541 TB), eliminating the need to scour multiple sources.
Includes datasets in English, German, Arabic, Albanian, Urdu, Kinyarwanda, and Kirundi, such as German political speeches and Kinyarwanda news articles, catering to multilingual NLP research.
Covers a wide range including news, social media, reviews, and academic papers, with examples like Amazon reviews, Reddit comments, and Enron emails, supporting varied model training.
References original providers like AWS, Kaggle, and CrowdFlower, and includes a dedicated sources section, ensuring transparency and further exploration.
Primarily lists raw, unstructured text; the README explicitly states that for annotated corpora or Treebanks, users should refer to other sources, limiting use for supervised learning.
No API or search functionality; users must manually browse and download from external links, and some datasets require agreements or are on request, adding friction.
As a static GitHub repository, datasets may become unavailable or change over time without regular updates, risking broken links or obsolete data, as noted with some sources from years ago.
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
:book: A curated list of resources dedicated to Natural Language Processing (NLP)
A curated list of resources for Chinese NLP 中文自然语言处理相关资料
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.