A command-line tool for holistic comparison and error analysis of language generation systems like machine translation and summarization.
compare-mt is a Python command-line tool for comparing the outputs of multiple language generation systems, such as machine translation, summarization, and dialog response generation. It analyzes text outputs against a reference to identify meaningful differences in performance, helping users understand what one system does better than another.
Researchers and developers working on language generation systems, including machine translation, summarization, and dialog response, who need detailed error analysis and comparison beyond aggregate scores.
It provides a holistic, automated analysis with multiple granularities (word, sentence, n-gram) and supports various metrics (BLEU, ROUGE, COMET), making it easier to pinpoint specific strengths and weaknesses between systems.
A tool for holistic analysis of language generations systems
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides word-level accuracy by frequency, sentence bucket analysis, and n-gram differences, moving beyond aggregate scores to pinpoint specific weaknesses, as demonstrated in the example comparing neural vs. phrase-based MT.
Supports BLEU, ROUGE, COMET, and word likelihoods, enabling cross-task evaluation like summarization with ROUGE scores, as shown in the summarization example.
Includes bootstrap resampling for metrics, helping assess the reliability of differences between systems based on data sampling, with configurable samples and probability thresholds.
Allows analysis by POS tags, source word features, and numerical labels, as evidenced by examples that reveal differences in content vs. function words or sentence position effects.
Requires aligned data, label files, or pre-processed counts for advanced analyses, adding significant overhead before use, as seen in the need for freq_corpus_file or ref_labels options.
Operates solely via command line with static HTML output, lacking a GUI or API for easier integration into modern workflows, which can hinder real-time collaboration.
Using the COMET metric necessitates separate installation and GPU access for efficiency, creating a barrier for users without hardware resources, as noted in the README.