This dataset provides structured NBA basketball game data paired with human-written summaries, enabling research in data-to-document generation. It serves as a benchmark for training and evaluating models that convert structured statistics into coherent natural language narratives. ## Key Features - **Aligned Summaries and Statistics** — Each human-written game summary is paired with corresponding box-scores and line-scores. - **Dual Source Coverage** — Includes data from Rotowire (2014–2017) and SBNation (2006–2017) with distinct writing styles. - **Structured JSON Format** — Data is provided in a consistent JSON schema with team, player, and game details. - **Preprocessed for NLP** — Summaries are tokenized and cleaned, with numeric values standardized as integers. - **Standard Splits** — Data is divided into training, validation, and test sets for machine learning experiments. ## Philosophy The dataset is designed to support reproducible research in natural language generation, focusing on the challenge of transforming structured sports data into fluent, informative text.

Stars115

Forks25

Last commit4 years ago

WeatherGov

Computer-generated weather forecasts from weather.gov (US public forecast), along with corresponding weather data

YelpNLG provides resources for natural language generation of restaurant reviews

Stars0

Forks0

Last commit