This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
The Schema-Guided Dialogue Dataset
This dataset provides structured NBA basketball game data paired with human-written summaries, enabling research in data-to-document generation. It serves as a benchmark for training and evaluating models that convert structured statistics into coherent natural language narratives. ## Key Features - **Aligned Summaries and Statistics** — Each human-written game summary is paired with corresponding box-scores and line-scores. - **Dual Source Coverage** — Includes data from Rotowire (2014–2017) and SBNation (2006–2017) with distinct writing styles. - **Structured JSON Format** — Data is provided in a consistent JSON schema with team, player, and game details. - **Preprocessed for NLP** — Summaries are tokenized and cleaned, with numeric values standardized as integers. - **Standard Splits** — Data is divided into training, validation, and test sets for machine learning experiments. ## Philosophy The dataset is designed to support reproducible research in natural language generation, focusing on the challenge of transforming structured sports data into fluent, informative text.
Computer-generated weather forecasts from weather.gov (US public forecast), along with corresponding weather data
YelpNLG provides resources for natural language generation of restaurant reviews