A transfer learning-based evaluation metric for Natural Language Generation that scores text fluency and meaning.
BLEURT is a learned evaluation metric for Natural Language Generation that scores how well a candidate sentence matches a reference in terms of fluency and meaning. It uses transfer learning from models like BERT and RemBERT, trained on human ratings data to provide more accurate and robust assessments than traditional metrics like BLEU. It's designed for tasks where automated quality evaluation of generated text is needed.
Researchers and developers working on NLG systems such as machine translation, text summarization, or dialogue generation who need reliable automated evaluation metrics. It's also useful for those benchmarking model performance in academic or industrial settings.
BLEURT offers higher correlation with human judgments compared to traditional metrics by leveraging deep learning and transfer learning. Its flexibility through fine-tuning and multilingual support makes it a versatile choice for diverse NLG evaluation scenarios across many languages.
BLEURT is a metric for Natural Language Generation based on transfer learning.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
BLEURT is fine-tuned on human ratings data, making it more aligned with human judgment than traditional metrics like BLEU, as stated in its philosophy of moving beyond string-matching approaches.
Tested on 13 languages and theoretically supports over 100 via multilingual training from models like RemBERT, providing versatility for global NLG applications, as noted in the Language Coverage section.
Offers command-line, Python, and TensorFlow APIs, allowing seamless integration into various evaluation pipelines, detailed in the installation and usage examples.
Includes batch size tuning, length-based batching, and distilled models for significant speed improvements, with examples showing up to 20x faster scoring in the 'Speeding Up BLEURT' section.
The README admits that the default test checkpoint is based on BERT-Tiny and is 'very inaccurate,' forcing users to download larger models like BLEURT-20 for reliable results.
Relies heavily on TensorFlow and benefits from GPUs, which can be a barrier for setups with limited resources or those avoiding complex ML frameworks, as mentioned in the installation requirements.
Scores are described as noisy, necessitating averaging over corpora for robust evaluation, which might not be ideal for single-instance assessments or real-time feedback.