Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Software Engineering for Machine Learning
  3. Robustness Metrics

Robustness Metrics

Apache-2.0Jupyter Notebook

Lightweight Python library for evaluating classification model robustness across out-of-distribution generalization, stability, and uncertainty metrics.

GitHubGitHub
473 stars32 forks0 contributors

What is Robustness Metrics?

Robustness Metrics is a Python library developed by Google Research that provides standardized modules for evaluating classification model robustness. It measures how well models perform across three critical dimensions: out-of-distribution generalization, stability under input perturbations, and uncertainty calibration. The library solves the problem of inconsistent robustness evaluation by offering lightweight, framework-agnostic tools that work with any model producing logits.

Target Audience

Machine learning researchers and practitioners who need to systematically evaluate and compare the robustness of classification models across different architectures and training regimes. This includes teams benchmarking model performance for research papers or production deployment.

Value Proposition

Developers choose Robustness Metrics because it provides standardized, reproducible evaluation across multiple robustness dimensions with minimal integration effort. Unlike custom evaluation scripts, it offers pre-built datasets, measurements, and reports while remaining framework-agnostic—working with TensorFlow, PyTorch, JAX, or any model that produces logits.

Overview

Robustness Metrics provides lightweight modules to evaluate the robustness of classification models across three key dimensions: out-of-distribution generalization, stability under natural perturbations, and uncertainty calibration. It includes popular benchmark datasets and works with any model that maps inputs to logits, making it applicable beyond vision models.

Key Features

  • Out-of-distribution evaluation — Assesses model performance on shifted datasets like ImageNetV2 and ImageNet-C
  • Stability measurement — Evaluates prediction consistency under natural input perturbations
  • Uncertainty quantification — Measures how well predicted probabilities reflect true probabilities
  • Framework-agnostic — Works with TensorFlow, PyTorch, JAX, or any model producing logits
  • Pre-built datasets — Includes standardized out-of-distribution datasets for benchmarking
  • Flexible reporting — Supports custom measurement combinations or predefined robustness reports

Philosophy

The library emphasizes lightweight, modular evaluation that works across different model frameworks while providing standardized benchmarks for meaningful robustness comparisons.

Use Cases

Best For

  • Benchmarking model performance on out-of-distribution datasets like ImageNet-C
  • Evaluating prediction stability under natural input perturbations
  • Measuring uncertainty calibration in classification models
  • Comparing robustness across different model architectures
  • Reproducible robustness evaluation for research papers
  • Framework-agnostic model assessment (works with TensorFlow, PyTorch, JAX)

Not Ideal For

  • Teams needing only basic accuracy metrics on in-distribution data without robustness considerations
  • Projects involving non-classification models like regression or generative adversarial networks
  • Environments where minimizing dependencies is critical, as it requires TensorFlow even for non-TF models
  • Real-time or low-latency evaluation scenarios due to batch processing and dataset loading overhead

Pros & Cons

Pros

Framework-Agnostic Design

Works with TensorFlow, PyTorch, JAX, or any model producing logits, as demonstrated by example model files in the repository like vit.py (JAX) and vgg.py (PyTorch).

Integrated Benchmark Datasets

Includes pre-built out-of-distribution datasets such as ImageNetV2 and ImageNet-C, making it easy to standardize evaluations without manual dataset handling.

Flexible Evaluation Modules

Allows specifying individual measurements via --measurement flag or using predefined reports with --report flag, enabling both custom and standardized robustness assessments.

Reproducible Metrics

Provides lightweight, modular tools that ensure consistent robustness comparisons across different models, as emphasized in the library's philosophy for meaningful benchmarking.

Cons

Unavoidable TensorFlow Dependency

Requires TensorFlow and TensorFlow Probability installation regardless of the model framework, adding unnecessary bloat for users working exclusively with PyTorch or JAX.

Manual Model Integration

Users must write a custom `create` function to interface their model, which is more cumbersome than plug-and-play libraries and can be error-prone for complex setups.

Classification-Only Focus

The library is explicitly designed for classification models, limiting its applicability to other machine learning tasks like regression or object detection.

Sparse Documentation

The README has TODO notes and limited examples, which may hinder initial setup and understanding, especially for users unfamiliar with tensorflow_datasets integration.

Frequently Asked Questions

Quick Stats

Stars473
Forks32
Contributors0
Open Issues10
Last commit3 months ago
CreatedSince 2020

Tags

#python-library#model-evaluation#uncertainty-quantification#benchmarking#machine-learning

Built With

T
TensorFlow
P
Python

Included in

Software Engineering for Machine Learning1.3k
Auto-fetched 11 hours ago

Related Projects

PyTorch LightningPyTorch Lightning

Pretrain, finetune ANY AI model of ANY size on 1 or 10,000+ GPUs with zero code changes.

Stars31,142
Forks3,722
Last commit3 days ago
Label StudioLabel Studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format

Stars27,330
Forks3,530
Last commit1 day ago
Great ExpectationsGreat Expectations

Always know what to expect from your data.

Stars11,513
Forks1,749
Last commit3 days ago
Seldon CoreSeldon Core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

Stars4,748
Forks865
Last commit1 month ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub