A dataset of NBA game summaries aligned with box- and line-scores for data-to-text generation research.
Boxscore Data is a research dataset that pairs NBA basketball game summaries with corresponding box-scores and line-scores. It was created to support data-to-document generation tasks, where models learn to produce human-like narratives from structured statistics. The dataset includes games from 2006 to 2017, sourced from Rotowire and SBNation, and is formatted in JSON for easy integration into machine learning pipelines.
Researchers and students in natural language processing, particularly those working on data-to-text generation, summarization, or sports analytics. It is also suitable for educators creating assignments on structured data processing.
This dataset provides a clean, aligned corpus of sports statistics and narratives, which is rare and valuable for training generative models. Its standardized format and preprocessed content reduce the overhead of data cleaning, allowing researchers to focus on model development and evaluation.
This dataset provides structured NBA basketball game data paired with human-written summaries, enabling research in data-to-document generation. It serves as a benchmark for training and evaluating models that convert structured statistics into coherent natural language narratives.
The dataset is designed to support reproducible research in natural language generation, focusing on the challenge of transforming structured sports data into fluent, informative text.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Each human-written game summary is directly paired with corresponding box-scores and line-scores, enabling straightforward training for data-to-text generation models as described in the README.
Includes data from Rotowire (2014–2017) and SBNation (2006–2017) with distinct writing styles, allowing researchers to compare narrative approaches and improve model robustness.
Summaries are tokenized with nltk, numbers standardized to integers, and irrelevant content like tweets removed, reducing preprocessing overhead for machine learning pipelines.
Data is provided in a consistent JSON schema with detailed team, player, and game objects, making it easy to parse and integrate into experimental setups.
Pre-divided into training, validation, and test sets for both Rotowire and SBNation data, facilitating reproducible research and benchmarking.
The README recommends using SportSett:Basketball instead due to contamination issues in the Rotowire dataset, where box- and line-scores appear in multiple splits, undermining data integrity for some experiments.
Covers NBA games only up to 2017, making it unsuitable for research requiring recent data or contemporary player and team performances.
Exclusively focuses on NBA basketball without inclusion of other sports or leagues, restricting its applicability to broader data-to-text tasks without significant adaptation.