Exposes NVIDIA GPU metrics for Prometheus monitoring using the NVIDIA Data Center GPU Manager (DCGM).
DCGM-Exporter is a Prometheus exporter that collects and exposes metrics from NVIDIA GPUs using the NVIDIA Data Center GPU Manager (DCGM). It solves the problem of monitoring GPU performance, health, and utilization in data center and cloud-native environments by providing standardized metrics that can be scraped by Prometheus and visualized in tools like Grafana.
System administrators, DevOps engineers, and data scientists who manage GPU-accelerated workloads in Kubernetes clusters, data centers, or High-Performance Computing (HPC) environments and need detailed GPU monitoring.
Developers choose DCGM-Exporter because it provides a production-ready, officially supported way to monitor NVIDIA GPUs with deep integration into the Prometheus ecosystem. Its flexibility, Kubernetes-native deployment, and support for custom metrics and HPC job mapping make it superior to basic monitoring solutions.
NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Exposes over 100 DCGM fields including SM clock, memory clock, and temperature as Prometheus gauges, providing deep insights into GPU performance and health directly from the NVIDIA API.
Includes a Helm chart for easy installation and integrates seamlessly with the NVIDIA GPU Operator, making it production-ready for cloud-native environments without manual configuration.
Allows users to specify which DCGM fields to collect via a custom CSV file, enabling tailored monitoring setups without code changes, as shown in the default-counters.csv example.
Can include HPC job IDs in metric labels by reading GPU-to-job mapping files from a directory, essential for tracking GPU usage in high-performance computing clusters with minimal setup.
Relies entirely on NVIDIA DCGM, so it cannot monitor GPUs from other vendors like AMD or Intel, creating vendor lock-in and limiting use in mixed-hardware setups.
Requires DCGM to be installed and compatible with GPU drivers, adding setup complexity and potential versioning issues, especially in non-containerized environments.
Official documentation is hosted on docs.nvidia.com, separate from the GitHub repo, which can make it harder to find up-to-date information and contribute, as noted in the README.
As shown in the quickstart example, some metrics like memory temperature may display unrealistic default values (e.g., 9223372036854775794), indicating potential data quality or initialization issues.