WebNLG

Python

An enriched dataset for Natural Language Generation research, providing intermediate representations for pipeline tasks like lexicalization and aggregation.

GitHub

71 stars22 forks0 contributors

Overview

The enriched version of the WebNLG described at INLG 2018

Related Projects

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub

5 years ago

CreatedSince 2018

Included in

Natural Language Generation480

Auto-fetched 23 hours ago

The Schema-Guided Dialogue Dataset

Stars606

Forks134

Last commit3 years ago

Box-score data

This dataset provides structured NBA basketball game data paired with human-written summaries, enabling research in data-to-document generation. It serves as a benchmark for training and evaluating models that convert structured statistics into coherent natural language narratives. ## Key Features - **Aligned Summaries and Statistics** — Each human-written game summary is paired with corresponding box-scores and line-scores. - **Dual Source Coverage** — Includes data from Rotowire (2014–2017) and SBNation (2006–2017) with distinct writing styles. - **Structured JSON Format** — Data is provided in a consistent JSON schema with team, player, and game details. - **Preprocessed for NLP** — Summaries are tokenized and cleaned, with numeric values standardized as integers. - **Standard Splits** — Data is divided into training, validation, and test sets for machine learning experiments. ## Philosophy The dataset is designed to support reproducible research in natural language generation, focusing on the challenge of transforming structured sports data into fluent, informative text.

Stars115

Forks25

Last commit4 years ago

Alex Context NLG Dataset

Dataset for NLG which contains preceding context along with each generation instance

Stars23

Forks12

Last commit9 years ago

Neural-Wikipedian

Neural-Wikipedian is a research project that adapts encoder-decoder neural network frameworks to automatically generate textual summaries (biographies) from structured Semantic Web triples. It addresses the challenge of transforming machine-readable knowledge base data into coherent, human-readable narratives, which is valuable for automating content creation and enhancing data accessibility. ## Key Features **Triple-to-Text Generation** — Converts sets of RDF triples (from DBpedia and Wikidata) into fluent English biography summaries. **Dual Dataset Support** — Includes aligned datasets of DBpedia and Wikidata triples paired with Wikipedia biographies for training and evaluation. **Neural Architectures** — Implements both Triples2LSTM and Triples2GRU models using the Torch framework for sequence generation. **Baseline Language Model** — Provides a KenLM n-gram language model as a comparative baseline for summary generation. **Pre-trained Models** — Offers downloadable pre-trained models for immediate inference without requiring full training cycles. ## Philosophy The project approaches biography generation as a structured data-to-text translation problem, leveraging neural networks to learn the linguistic patterns and factual associations present in Wikipedia content.

Stars10

Forks1

Last commit8 years ago