Monitor NVIDIA GPU Utilization on Ubuntu Using NVIDIA-SMI

Introduction

If you have an NVIDIA GPU like GeForce, RTX, Tesla, or Quadro, you can monitor GPU Ubuntu server using the command-line tool nvidia-smi. This guide explains everything a beginner needs to know, from installation to real-time monitoring.

What is NVIDIA-SMI?

nvidia-smi (NVIDIA System Management Interface) is a command-line utility included with NVIDIA GPU drivers. It allows you to:

  • Monitor GPU usage and memory
  • Check temperature and power consumption
  • Track which processes are using the GPU
  • Manage GPU settings such as persistence mode

It comes pre-installed with the NVIDIA RTX driver and CUDA toolkit. There’s no separate installation needed if you install the official NVIDIA driver.

Installing NVIDIA-SMI

  1. Make sure your GPU driver is installed. Example for Ubuntu 24 with RTX 4090:
sudo apt install -y nvidia-driver-535

Verify nvidia-smi is available:

nvidia-smi

Basic NVIDIA-SMI Commands

  1. Check GPU status
nvidia-smi

Shows a summary of GPU usage, driver version, memory, and running processes.
2. Detailed information

nvidia-smi -a

Displays complete info about your GPU monitoring and CUDA monitoring, including temperature, power usage, and performance state.
3. Real-time monitoring

watch -n 1 -d nvidia-smi

Refreshes GPU stats every second for continuous monitoring.
4. List all NVIDIA devices

nvidia-smi -L

Example output:

GPU 0: NVIDIA GeForce RTX 3060 Ti (UUID: GPU-fa3da260-9c42-828f-981a-f6d7b48d77b3)
  1. Query specific GPU details
nvidia-smi --query-gpu=index,name,uuid,serial --format=csv
  1. Monitor GPU utilization per second
nvidia-smi dmon

Columns include: GPU, Power, Temperature, SM (compute usage), Memory, Encoder/Decoder usage, and Clock speeds.
7. Monitor per-process GPU usage

nvidia-smi pmon

Shows GPU usage per running process, including PID, process type, SM%, memory%, encoder/decoder%, and command.

Understanding NVIDIA-SMI Metrics

Metric Description
Temp Core GPU temperature in °C. Normal operation is <90°C.
Perf Performance state (P0 maximum – P12 minimum).
Persistence-M “On” keeps the NVIDIA driver loaded for faster application startup.
Pwr: Usage/Cap Current power consumption in Watts.
Bus-Id PCI bus ID of GPU.
Disp.A Display active flag; Off if no display attached.
Memory-Usage GPU memory allocation; important for AI/ML workloads.
Volatile Uncorr. ECC Error count of memory corruption since last driver load.
GPU-Util Percent of GPU utilization over the sampling period.
Compute M. GPU compute mode (shared or exclusive).
GPU Index for multi-GPU systems.
PID / Process Name Process ID and name using GPU.

Tips for Beginners

  • Run nvidia-smi frequently when training models to monitor memory and utilization.
  • Use watch -n 1 nvidia-smi for real-time observation.
  • If GPU temperature exceeds 90°C, check cooling or reduce load.
  • Check per-process usage with pmon to find resource-heavy applications.

Summary

nvidia-smi is the most reliable tool for monitoring NVIDIA GPUs on Ubuntu.

  • Comes pre-installed with drivers
  • Provides real-time GPU usage, memory, and temperature
  • Supports per-process monitoring and multi-GPU setups

Beginners can use this to track GPU health, optimize AI workloads on Linux GPU server, or debug performance issues.

Outline