Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Tags
  3. Corpus

Corpus

8 projects

Showing 8 of 8 projects

nlp-datasets
nlp-datasets

An alphabetical list of free and public domain text datasets for Natural Language Processing (NLP) tasks.

#data-curation#nlp-resources#multilingual
Stars6.0k
Forks989
Last commit3 years ago
karthinkncode's Datasets for Natural Language Processing
karthinkncode's Datasets for Natural Language Processing

A collaboratively maintained, reverse-chronological list of datasets and corpora for natural language processing tasks.

#ai-training-data#nlp-research#research-tools
Stars918
Forks249
Last commit6 years ago
quanteda
quantedaR

An R package for the quantitative analysis of textual data, providing comprehensive tools for natural language processing and text management.

#computational-linguistics#parallel-computing#r-package
Stars884
Forks191
Last commit5 days ago
Seq2seq-Chatbot
Seq2seq-ChatbotPython

A minimal 200-line implementation of a sequence-to-sequence chatbot using TensorLayer and TensorFlow.

#chat#educational#tensorlayer
Stars840
Forks309
Last commit4 years ago
FakeNewsCorpus
FakeNewsCorpus

A dataset of millions of news articles labeled by credibility type for training fake news detection algorithms.

#database#data-scraping#text-corpus
Stars413
Forks98
Last commit6 years ago
CORD-19
CORD-19

A corpus of academic papers about COVID-19 and related coronavirus research for text mining and NLP.

#document-embeddings#semantic-scholar#natural-language-processing
Stars186
Forks23
Last commit1 year ago
colibri-core
colibri-coreC++

A C++ and Python library for efficient extraction and analysis of n-grams, skipgrams, and flexgrams from large corpora.

#c-plus-plus-library#computational-linguistics#pattern-modeling
Stars130
Forks20
Last commit4 months ago
WebNLG
WebNLGPython

An enriched dataset for Natural Language Generation research, providing intermediate representations for pipeline tasks like lexicalization and aggregation.

#pipeline-architecture#nlp-research#data-to-text
Stars70
Forks22
Last commit5 years ago

Related Tags

#Natural Language Processing6#Machine Learning4#Nlp3
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub