A large-scale multi-domain dataset of over 20k annotated task-oriented dialogues for training and evaluating virtual assistants.
The Schema-Guided Dialogue (SGD) dataset is a large-scale collection of over 20,000 annotated, task-oriented conversations between humans and virtual assistants across 20 domains. It provides structured schemas defining service APIs and rich annotations to support the development and evaluation of dialogue systems for real-world applications like booking, information retrieval, and transactions. The dataset addresses the need for scalable, multi-domain benchmarks that reflect the complexity of actual virtual assistant interactions.
Researchers and engineers working on conversational AI, natural language understanding, and dialogue systems, particularly those focused on task-oriented virtual assistants, dialogue state tracking, and zero-shot generalization.
It offers unparalleled scale and domain diversity with structured schema definitions, enabling robust model training and evaluation. The inclusion of the SGD-X benchmark for linguistic variation robustness and unseen domains for zero-shot testing makes it a comprehensive tool for advancing real-world dialogue system capabilities.
The Schema-Guided Dialogue Dataset
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Covers 20 domains with multiple services and overlapping APIs, as highlighted in the overview, providing a realistic testbed for virtual assistants simulating real-world complexity.
Includes dialogue state, user/system acts, service calls, and results, enabling research on NLU, DST, and policy learning, detailed in the dialogue representation section.
Features unseen domains in the test set, allowing for rigorous assessment of model generalization, a key focus in the dataset design for advancing few-shot capabilities.
Offers crowdsourced schema variants to benchmark how well systems handle stylistic diversity, as described in the SGD-X benchmark for measuring real-world robustness.
The JSON representation of dialogues and schemas is highly nested with frames, slots, and actions, requiring significant parsing effort and preprocessing, which can hinder quick experimentation.
Conversations are generated via a simulator and crowd-workers, which may not fully capture the spontaneity and errors of natural human interactions, limiting realism for some applications.
While baseline code is linked, the dataset itself lacks built-in preprocessing or evaluation scripts, placing the burden on users to implement full pipelines from raw data to models.