A Python library for evaluating natural language generation models using multiple unsupervised automated metrics.
nlg-eval is a Python library for evaluating natural language generation models using multiple unsupervised automated metrics. It computes scores like BLEU, METEOR, ROUGE, and CIDEr by comparing generated text (hypotheses) against reference texts, helping researchers assess model performance quantitatively.
Researchers and developers working on natural language generation, machine translation, text summarization, or dialogue systems who need standardized evaluation metrics.
It consolidates multiple NLG metrics into a single, easy-to-use package with flexible APIs, reducing the need to implement each metric separately and ensuring consistent evaluation across projects.
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Consolidates multiple established metrics like BLEU, METEOR, ROUGE, and CIDEr into one package, reducing implementation effort for standardized NLG evaluation as listed in the README.
Offers command-line, functional, and object-oriented Python APIs, supporting both single examples and batch processing for diverse use cases.
Includes a setup script to automatically download required models and embeddings, simplifying initial configuration.
Allows custom directories for model storage via environment variables, facilitating shared or Dockerized deployments.
Requires Java installation and large file downloads, with noted issues on Windows and macOS High Sierra, complicating cross-platform deployment.
Relies on older models like SkipThought and GloVe from 2017, which may not reflect current best practices in NLP embeddings.
Admits issues with small datasets where scores can be zero, requiring external patches from other repositories for reliable results.