Boxscore Data vs SportSett Basketball dataset: which should I use?

SportSett:Basketball is recommended by the authors as it corrects contamination issues in the original Rotowire dataset. Use Boxscore Data for historical comparisons or if you need the SBNation subset, but for cleaner splits, opt for SportSett.

How to load and parse boxscore-data JSON files in Python?

Extract the tar.bz2 files to get JSON files, then use Python's json module to load them. Each file contains a list of objects with fields like summary and box_score, which can be accessed as dictionaries for analysis or model training.

Is boxscore-data suitable for real-time sports commentary generation?

No, because the data is historical and static, ending in 2017. For real-time applications, you'd need a live data feed and likely more recent datasets to capture current NBA trends and player stats.

What preprocessing steps are applied to the summaries in boxscore-data?

Summaries are tokenized with nltk, hyphenated phrases separated, and for SBNation, tweets and photos were removed. Paragraphs without at least two numbers were also excluded to ensure data relevance for generation tasks.

Can I use boxscore-data for player performance prediction models?

While it includes detailed player stats like points and rebounds, it's designed for text generation, not prediction. For forecasting, you might need additional features or more recent data, but it could serve as a baseline for historical analysis.

How does the Rotowire data differ from SBNation data in style?

Rotowire summaries are more formal and statistic-heavy, while SBNation includes team-specific sites with varied, narrative-driven writing. This duality allows researchers to train models on different linguistic styles for better generalization.

Box-score data — NBA Game Summaries Dataset

What is Box-score data?

Boxscore Data is a research dataset that pairs NBA basketball game summaries with corresponding box-scores and line-scores. It was created to support data-to-document generation tasks, where models learn to produce human-like narratives from structured statistics. The dataset includes games from 2006 to 2017, sourced from Rotowire and SBNation, and is formatted in JSON for easy integration into machine learning pipelines.

Target Audience

Researchers and students in natural language processing, particularly those working on data-to-text generation, summarization, or sports analytics. It is also suitable for educators creating assignments on structured data processing.

Value Proposition

This dataset provides a clean, aligned corpus of sports statistics and narratives, which is rare and valuable for training generative models. Its standardized format and preprocessed content reduce the overhead of data cleaning, allowing researchers to focus on model development and evaluation.

Overview

This dataset provides structured NBA basketball game data paired with human-written summaries, enabling research in data-to-document generation. It serves as a benchmark for training and evaluating models that convert structured statistics into coherent natural language narratives.

Key Features

Aligned Summaries and Statistics — Each human-written game summary is paired with corresponding box-scores and line-scores.
Dual Source Coverage — Includes data from Rotowire (2014–2017) and SBNation (2006–2017) with distinct writing styles.
Structured JSON Format — Data is provided in a consistent JSON schema with team, player, and game details.
Preprocessed for NLP — Summaries are tokenized and cleaned, with numeric values standardized as integers.
Standard Splits — Data is divided into training, validation, and test sets for machine learning experiments.

Philosophy

The dataset is designed to support reproducible research in natural language generation, focusing on the challenge of transforming structured sports data into fluent, informative text.

Box-score data

What is Box-score data?

Overview

Key Features

Philosophy

Related Projects

Found a gem we're missing?

Use Cases

Best For

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions