Question 1

How do I install DCGM-Exporter on a bare-metal server without Kubernetes?

Accepted Answer

You can run it via Docker with 'docker run --gpus all' or build from source using 'make install' after installing DCGM. Ensure the host has NVIDIA drivers and DCGM compatible with your GPUs for proper metric collection.

Question 2

What's the difference between DCGM-Exporter and the NVIDIA GPU Operator for monitoring?

Accepted Answer

DCGM-Exporter is a standalone Prometheus exporter focused on GPU metrics, while the GPU Operator manages the entire GPU software stack in Kubernetes, including the exporter. The README recommends using the Operator for Kubernetes deployments as it simplifies management.

Question 3

How can I customize which GPU metrics are collected in DCGM-Exporter?

Accepted Answer

Create a custom CSV file listing the desired DCGM fields, following the format in etc/default-counters.csv, and use the '-f' flag when running dcgm-exporter. This allows you to reduce overhead or focus on specific metrics like utilization or temperature.

Question 4

Does DCGM-Exporter work with AMD GPUs or only NVIDIA?

Accepted Answer

It only works with NVIDIA GPUs because it relies on NVIDIA's Data Center GPU Manager (DCGM). For AMD GPUs, you need alternative tools like ROCm monitoring solutions or vendor-specific exporters, which are not compatible with DCGM-Exporter.

Question 5

How do I secure DCGM-Exporter with TLS and authentication in production?

Accepted Answer

Use the '--web-config-file' flag with a YAML configuration file that defines TLS certificates and basic auth settings, as supported by the Prometheus exporter toolkit. A sample file is available in the exporter-toolkit repository for reference.

Question 6

Which Grafana dashboard should I use with DCGM-Exporter for visualization?

Accepted Answer

The official NVIDIA DCGM-Exporter dashboard is on Grafana.com with ID 12239, and the JSON file is included in the repository under 'grafana/dcgm-exporter-dashboard.json'. It provides pre-built panels for key GPU metrics like clock speeds and temperatures.

DCGM Exporter

What is DCGM Exporter?

Overview

Use Cases

Best For

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions