Featured image of post NVIDIA GPU Driver Persistence

NVIDIA GPU Driver Persistence

Document a troubleshooting process and introduce NVIDIA GPU driver persistence

NVIDIA GPU Driver Persistence Configuration and Troubleshooting

Overview

This article documents a monitoring anomaly issue caused by GPU driver non-persistent mode, and introduces the principles and configuration methods of NVIDIA GPU driver persistence.


Problem Symptoms

During algorithm program stress testing, the Grafana monitoring dashboard revealed that the Nvidia Exporter service was running unstably, showing intermittent behavior:

Nvidia Exporter Service

Initial Investigation

Ruling Out Prometheus Scrape Issues
Manually executing the curl http://localhost:9835/metrics command on the target GPU server resulted in a timeout, confirming that the issue was with the Exporter service itself.

Adjusting Log Level
The Nvidia Exporter log level was adjusted to debug, but no obvious error messages were found.

Root Cause Identification
Manually executing the query command used internally by Nvidia Exporter:

1
nvidia-smi --query-gpu=timestamp,driver_version,vgpu_driver_capability.heterogenous_multivGPU,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.baseClass,pci.subClass,pci.device_id,pci.sub_device_id,vgpu_device_capability.fractional_multiVgpu,vgpu_device_capability.heterogeneous_timeSlice_profile,vgpu_device_capability.heterogeneous_timeSlice_sizes,vgpu_device_capability.homogeneous_placements,pcie.link.gen.current,pcie.link.gen.gpucurrent,pcie.link.gen.max,pcie.link.gen.gpumax,pcie.link.gen.hostmax,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,addressing_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gpu_recovery_action,gom.current,gom.pending,fan.speed,pstate,clocks_event_reasons.supported,clocks_event_reasons.active,clocks_event_reasons.gpu_idle,clocks_event_reasons.applications_clocks_setting,clocks_event_reasons.sw_power_cap,clocks_event_reasons.hw_slowdown,clocks_event_reasons.hw_thermal_slowdown,clocks_event_reasons.hw_power_brake_slowdown,clocks_event_reasons.sw_thermal_slowdown,clocks_event_reasons.sync_boost,memory.total,memory.reserved,memory.used,memory.free,compute_mode,compute_cap,utilization.gpu,utilization.memory,utilization.encoder,utilization.decoder,utilization.jpeg,utilization.ofa,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,dramEncryption.mode.current,dramEncryption.mode.pending,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,ecc.errors.uncorrected.volatile.sram.parity,ecc.errors.uncorrected.volatile.sram.secded,ecc.errors.uncorrected.aggregate.sram.parity,ecc.errors.uncorrected.aggregate.sram.secded,ecc.errors.uncorrected.aggregate.sram.thresholdExceeded,ecc.errors.uncorrected.aggregate.sram.l2,ecc.errors.uncorrected.aggregate.sram.sm,ecc.errors.uncorrected.aggregate.sram.mcu,ecc.errors.uncorrected.aggregate.sram.pcie,ecc.errors.uncorrected.aggregate.sram.other,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,remapped_rows.correctable,remapped_rows.uncorrectable,remapped_rows.pending,remapped_rows.failure,remapped_rows.histogram.max,remapped_rows.histogram.high,remapped_rows.histogram.partial,remapped_rows.histogram.low,remapped_rows.histogram.none,temperature.gpu,temperature.gpu.tlimit,temperature.memory,power.management,power.draw,power.draw.average,power.draw.instant,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,module.power.draw.average,module.power.draw.instant,module.power.limit,module.enforced.power.limit,module.power.default_limit,module.power.min_limit,module.power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending,gsp.mode.current,gsp.mode.default,c2c.mode,protected_memory.total,protected_memory.used,protected_memory.free,fabric.state,fabric.status,platform.chassis_serial_number,platform.slot_number,platform.tray_index,platform.host_id,platform.peer_type,platform.module_id,platform.gpu_fabric_guid --format=csv

Key Finding: The command execution time fluctuated between 3-10 seconds, which was clearly abnormal. The test environment had 8 GPUs in total, with 2 being occupied by the algorithm program and the remaining 6 idle.

After reviewing the NVIDIA official documentation on GPU driver persistence, we attempted to enable persistent mode.


Solution

Temporarily Enable Persistent Mode

Execute the following command to immediately enable GPU driver persistence:

1
nvidia-smi -pm 1

After executing the query command again, the response time dropped to milliseconds, problem solved.

Configure Automatic Startup

To ensure the persistence configuration takes effect after system reboot, a systemd service needs to be configured:

1. Create Service Configuration File

1
sudo vim /usr/lib/systemd/system/nvidia-persistenced.service

2. Add the Following Content

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
[Unit]
Description=NVIDIA Persistence Daemon
Wants=network.target

[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
ExecStart=/usr/bin/nvidia-persistenced --persistence-mode
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target

3. Enable and Start the Service

1
sudo systemctl enable nvidia-persistenced.service --now

Verification

After configuration, the monitoring system returned to normal, with stable GPU usage collection:

Nvidia Exporter Service After Fix


Technical Principles

GPU Driver Loading Mechanism

NVIDIA GPU interaction depends on the kernel mode driver, which operates in two modes:

  • Persistent Mode: The driver remains continuously active
  • On-Demand Loading Mode: The driver loads only when a program uses the GPU

Driver Lifecycle

Initialization Phase
When the first program attempts to interact with the GPU, if the kernel driver is not running, the system triggers driver loading and GPU device initialization.

De-initialization Phase
After all GPU client programs exit, the driver executes GPU de-initialization operations, essentially “shutting down” the GPU device.

Impact on Users

Application Startup Delay

When GPU initialization is triggered for the first time, operations such as ECC memory checks cause a delay of 1-3 seconds. If the GPU is already initialized, there is no such delay.

Driver State Loss

After GPU de-initialization, non-persistent state information (such as power limits, clock frequency configurations, etc.) is lost and restored to default values upon the next initialization. Enabling persistent mode avoids this issue.


Platform Differences

Windows Platform

On Windows systems, the kernel driver loads at system startup and remains running until system shutdown. Therefore, Windows users typically do not need to be concerned about driver persistence issues.

Note: Driver reload events (such as TDR triggers or driver updates) will cause non-persistent state resets.

Linux Platform

Linux system behavior depends on the runtime environment:

Graphical Environment
If the X Server runs on the target GPU, the kernel driver typically remains active from boot to shutdown, maintained by the X process connection.

Headless Server Environment
On servers without a graphical interface (Headless Server), if there is no long-running GPU client, each application start and stop will trigger driver loading and unloading. This is extremely common in High-Performance Computing (HPC) and Data Center environments, which was the root cause of this incident.


Best Practice Recommendations

  1. Strongly recommended for production environments to enable GPU driver persistence, especially in headless server scenarios
  2. Use systemd service to ensure persistence configuration automatically takes effect after system reboot
  3. The monitoring system should be thoroughly tested after enabling persistence to verify the stability of metric collection
  4. Regularly check the nvidia-persistenced service status to ensure it is running properly

References

Facing the sea with spring blossoms.
Built with Hugo
Theme Stack designed by Jimmy