<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Driver Persistence on Kang&#39;s Blog</title>
        <link>https://blog.coderkang.top/en/tags/driver-persistence/</link>
        <description>Recent content in Driver Persistence on Kang&#39;s Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Fri, 07 Nov 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.coderkang.top/en/tags/driver-persistence/index.xml" rel="self" type="application/rss+xml" /><item>
            <title>NVIDIA GPU Driver Persistence</title>
            <link>https://blog.coderkang.top/en/p/nvidia_gpu_driver_persistence/</link>
            <pubDate>Fri, 07 Nov 2025 00:00:00 +0000</pubDate>
            <guid>https://blog.coderkang.top/en/p/nvidia_gpu_driver_persistence/</guid>
            <description>&lt;img src=&#34;https://blog.coderkang.top/&#34; alt=&#34;Featured image of post NVIDIA GPU Driver Persistence&#34; /&gt;&lt;h1 id=&#34;nvidia-gpu-driver-persistence-configuration-and-troubleshooting&#34;&gt;NVIDIA GPU Driver Persistence Configuration and Troubleshooting&#xD;&#xA;&lt;/h1&gt;&lt;h2 id=&#34;overview&#34;&gt;Overview&#xD;&#xA;&lt;/h2&gt;&lt;p&gt;This article documents a monitoring anomaly issue caused by GPU driver non-persistent mode, and introduces the principles and configuration methods of NVIDIA GPU driver persistence.&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;problem-symptoms&#34;&gt;Problem Symptoms&#xD;&#xA;&lt;/h2&gt;&lt;p&gt;During algorithm program stress testing, the Grafana monitoring dashboard revealed that the Nvidia Exporter service was running unstably, showing intermittent behavior:&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Nvidia Exporter Service&#34; class=&#34;gallery-image&#34; data-flex-basis=&#34;448px&#34; data-flex-grow=&#34;186&#34; height=&#34;872&#34; loading=&#34;lazy&#34; sizes=&#34;(max-width: 767px) calc(100vw - 30px), (max-width: 1023px) 700px, (max-width: 1279px) 950px, 1232px&#34; src=&#34;https://blog.coderkang.top/p/nvidia_gpu_driver_persistence/nvidia_exporter_service.png&#34; srcset=&#34;https://blog.coderkang.top/p/nvidia_gpu_driver_persistence/nvidia_exporter_service_hu_e4d82039efb1d802.png 800w, https://blog.coderkang.top/p/nvidia_gpu_driver_persistence/nvidia_exporter_service_hu_c444077d4475ddc7.png 1600w, https://blog.coderkang.top/p/nvidia_gpu_driver_persistence/nvidia_exporter_service.png 1628w&#34; width=&#34;1628&#34;&gt;&lt;/p&gt;&#xA;&lt;h3 id=&#34;initial-investigation&#34;&gt;Initial Investigation&#xD;&#xA;&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Ruling Out Prometheus Scrape Issues&lt;/strong&gt;&lt;br&gt;&#xA;Manually executing the &lt;code&gt;curl http://localhost:9835/metrics&lt;/code&gt; command on the target GPU server resulted in a timeout, confirming that the issue was with the Exporter service itself.&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Adjusting Log Level&lt;/strong&gt;&lt;br&gt;&#xA;The Nvidia Exporter log level was adjusted to &lt;code&gt;debug&lt;/code&gt;, but no obvious error messages were found.&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Root Cause Identification&lt;/strong&gt;&lt;br&gt;&#xA;Manually executing the query command used internally by Nvidia Exporter:&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;&#xA;&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1&#xA;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&#xA;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-shell&#34; data-lang=&#34;shell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi --query-gpu&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;timestamp,driver_version,vgpu_driver_capability.heterogenous_multivGPU,count,name,serial,uuid,pci.bus_id,pci.domain,pci.bus,pci.device,pci.baseClass,pci.subClass,pci.device_id,pci.sub_device_id,vgpu_device_capability.fractional_multiVgpu,vgpu_device_capability.heterogeneous_timeSlice_profile,vgpu_device_capability.heterogeneous_timeSlice_sizes,vgpu_device_capability.homogeneous_placements,pcie.link.gen.current,pcie.link.gen.gpucurrent,pcie.link.gen.max,pcie.link.gen.gpumax,pcie.link.gen.hostmax,pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,persistence_mode,addressing_mode,accounting.mode,accounting.buffer_size,driver_model.current,driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,gpu_recovery_action,gom.current,gom.pending,fan.speed,pstate,clocks_event_reasons.supported,clocks_event_reasons.active,clocks_event_reasons.gpu_idle,clocks_event_reasons.applications_clocks_setting,clocks_event_reasons.sw_power_cap,clocks_event_reasons.hw_slowdown,clocks_event_reasons.hw_thermal_slowdown,clocks_event_reasons.hw_power_brake_slowdown,clocks_event_reasons.sw_thermal_slowdown,clocks_event_reasons.sync_boost,memory.total,memory.reserved,memory.used,memory.free,compute_mode,compute_cap,utilization.gpu,utilization.memory,utilization.encoder,utilization.decoder,utilization.jpeg,utilization.ofa,encoder.stats.sessionCount,encoder.stats.averageFps,encoder.stats.averageLatency,dramEncryption.mode.current,dramEncryption.mode.pending,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,ecc.errors.uncorrected.volatile.sram.parity,ecc.errors.uncorrected.volatile.sram.secded,ecc.errors.uncorrected.aggregate.sram.parity,ecc.errors.uncorrected.aggregate.sram.secded,ecc.errors.uncorrected.aggregate.sram.thresholdExceeded,ecc.errors.uncorrected.aggregate.sram.l2,ecc.errors.uncorrected.aggregate.sram.sm,ecc.errors.uncorrected.aggregate.sram.mcu,ecc.errors.uncorrected.aggregate.sram.pcie,ecc.errors.uncorrected.aggregate.sram.other,retired_pages.single_bit_ecc.count,retired_pages.double_bit.count,retired_pages.pending,remapped_rows.correctable,remapped_rows.uncorrectable,remapped_rows.pending,remapped_rows.failure,remapped_rows.histogram.max,remapped_rows.histogram.high,remapped_rows.histogram.partial,remapped_rows.histogram.low,remapped_rows.histogram.none,temperature.gpu,temperature.gpu.tlimit,temperature.memory,power.management,power.draw,power.draw.average,power.draw.instant,power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,module.power.draw.average,module.power.draw.instant,module.power.limit,module.enforced.power.limit,module.power.default_limit,module.power.min_limit,module.power.max_limit,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,mig.mode.current,mig.mode.pending,gsp.mode.current,gsp.mode.default,c2c.mode,protected_memory.total,protected_memory.used,protected_memory.free,fabric.state,fabric.status,platform.chassis_serial_number,platform.slot_number,platform.tray_index,platform.host_id,platform.peer_type,platform.module_id,platform.gpu_fabric_guid --format&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;csv&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&#xA;&lt;/div&gt;&#xA;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Key Finding&lt;/strong&gt;: The command execution time fluctuated between 3-10 seconds, which was clearly abnormal. The test environment had 8 GPUs in total, with 2 being occupied by the algorithm program and the remaining 6 idle.&lt;/p&gt;&#xA;&lt;p&gt;After reviewing the NVIDIA &lt;a class=&#34;link&#34; href=&#34;https://docs.nvidia.com/deploy/driver-persistence/index.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;&#xD;&#xA;    &gt;official documentation on GPU driver persistence&lt;/a&gt;, we attempted to enable persistent mode.&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;solution&#34;&gt;Solution&#xD;&#xA;&lt;/h2&gt;&lt;h3 id=&#34;temporarily-enable-persistent-mode&#34;&gt;Temporarily Enable Persistent Mode&#xD;&#xA;&lt;/h3&gt;&lt;p&gt;Execute the following command to immediately enable GPU driver persistence:&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;&#xA;&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1&#xA;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&#xA;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-shell&#34; data-lang=&#34;shell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;nvidia-smi -pm &lt;span class=&#34;m&#34;&gt;1&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&#xA;&lt;/div&gt;&#xA;&lt;/div&gt;&lt;p&gt;After executing the query command again, the response time dropped to milliseconds, &lt;strong&gt;problem solved&lt;/strong&gt;.&lt;/p&gt;&#xA;&lt;h3 id=&#34;configure-automatic-startup&#34;&gt;Configure Automatic Startup&#xD;&#xA;&lt;/h3&gt;&lt;p&gt;To ensure the persistence configuration takes effect after system reboot, a systemd service needs to be configured:&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;1. Create Service Configuration File&lt;/strong&gt;&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;&#xA;&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1&#xA;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&#xA;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-shell&#34; data-lang=&#34;shell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo vim /usr/lib/systemd/system/nvidia-persistenced.service&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&#xA;&lt;/div&gt;&#xA;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;2. Add the Following Content&lt;/strong&gt;&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;&#xA;&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11&#xA;&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12&#xA;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&#xA;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-ini&#34; data-lang=&#34;ini&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Unit]&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Description&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;NVIDIA Persistence Daemon&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Wants&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;network.target&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Service]&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;Type&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;forking&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;PIDFile&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;/var/run/nvidia-persistenced/nvidia-persistenced.pid&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;ExecStart&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;/usr/bin/nvidia-persistenced --persistence-mode&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;ExecStopPost&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;/bin/rm -rf /var/run/nvidia-persistenced&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;[Install]&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;na&#34;&gt;WantedBy&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;multi-user.target&lt;/span&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&#xA;&lt;/div&gt;&#xA;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;3. Enable and Start the Service&lt;/strong&gt;&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;&#xA;&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1&#xA;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&#xA;&lt;td class=&#34;lntd&#34;&gt;&#xA;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-shell&#34; data-lang=&#34;shell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;sudo systemctl &lt;span class=&#34;nb&#34;&gt;enable&lt;/span&gt; nvidia-persistenced.service --now&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&#xA;&lt;/div&gt;&#xA;&lt;/div&gt;&lt;h3 id=&#34;verification&#34;&gt;Verification&#xD;&#xA;&lt;/h3&gt;&lt;p&gt;After configuration, the monitoring system returned to normal, with stable GPU usage collection:&lt;/p&gt;&#xA;&lt;p&gt;&lt;img alt=&#34;Nvidia Exporter Service After Fix&#34; class=&#34;gallery-image&#34; data-flex-basis=&#34;440px&#34; data-flex-grow=&#34;183&#34; height=&#34;866&#34; loading=&#34;lazy&#34; sizes=&#34;(max-width: 767px) calc(100vw - 30px), (max-width: 1023px) 700px, (max-width: 1279px) 950px, 1232px&#34; src=&#34;https://blog.coderkang.top/p/nvidia_gpu_driver_persistence/nvidia_exporter_service_after.png&#34; srcset=&#34;https://blog.coderkang.top/p/nvidia_gpu_driver_persistence/nvidia_exporter_service_after_hu_a3dbef001b004b7f.png 800w, https://blog.coderkang.top/p/nvidia_gpu_driver_persistence/nvidia_exporter_service_after.png 1589w&#34; width=&#34;1589&#34;&gt;&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;technical-principles&#34;&gt;Technical Principles&#xD;&#xA;&lt;/h2&gt;&lt;h3 id=&#34;gpu-driver-loading-mechanism&#34;&gt;GPU Driver Loading Mechanism&#xD;&#xA;&lt;/h3&gt;&lt;p&gt;NVIDIA GPU interaction depends on the kernel mode driver, which operates in two modes:&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;&lt;strong&gt;Persistent Mode&lt;/strong&gt;: The driver remains continuously active&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;On-Demand Loading Mode&lt;/strong&gt;: The driver loads only when a program uses the GPU&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h3 id=&#34;driver-lifecycle&#34;&gt;Driver Lifecycle&#xD;&#xA;&lt;/h3&gt;&lt;p&gt;&lt;strong&gt;Initialization Phase&lt;/strong&gt;&lt;br&gt;&#xA;When the first program attempts to interact with the GPU, if the kernel driver is not running, the system triggers driver loading and GPU device initialization.&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;De-initialization Phase&lt;/strong&gt;&lt;br&gt;&#xA;After all GPU client programs exit, the driver executes GPU de-initialization operations, essentially &amp;ldquo;shutting down&amp;rdquo; the GPU device.&lt;/p&gt;&#xA;&lt;h3 id=&#34;impact-on-users&#34;&gt;Impact on Users&#xD;&#xA;&lt;/h3&gt;&lt;h4 id=&#34;application-startup-delay&#34;&gt;Application Startup Delay&#xD;&#xA;&lt;/h4&gt;&lt;p&gt;When GPU initialization is triggered for the first time, operations such as ECC memory checks cause a delay of &lt;strong&gt;1-3 seconds&lt;/strong&gt;. If the GPU is already initialized, there is no such delay.&lt;/p&gt;&#xA;&lt;h4 id=&#34;driver-state-loss&#34;&gt;Driver State Loss&#xD;&#xA;&lt;/h4&gt;&lt;p&gt;After GPU de-initialization, non-persistent state information (such as power limits, clock frequency configurations, etc.) is lost and restored to default values upon the next initialization. Enabling persistent mode avoids this issue.&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;platform-differences&#34;&gt;Platform Differences&#xD;&#xA;&lt;/h2&gt;&lt;h3 id=&#34;windows-platform&#34;&gt;Windows Platform&#xD;&#xA;&lt;/h3&gt;&lt;p&gt;On Windows systems, the kernel driver loads at system startup and remains running until system shutdown. Therefore, Windows users typically do not need to be concerned about driver persistence issues.&lt;/p&gt;&#xA;&#xD;&#xA;    &lt;blockquote&gt;&#xD;&#xA;        &lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Driver reload events (such as TDR triggers or driver updates) will cause non-persistent state resets.&lt;/p&gt;&#xA;&#xD;&#xA;    &lt;/blockquote&gt;&#xD;&#xA;&lt;h3 id=&#34;linux-platform&#34;&gt;Linux Platform&#xD;&#xA;&lt;/h3&gt;&lt;p&gt;Linux system behavior depends on the runtime environment:&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Graphical Environment&lt;/strong&gt;&lt;br&gt;&#xA;If the X Server runs on the target GPU, the kernel driver typically remains active from boot to shutdown, maintained by the X process connection.&lt;/p&gt;&#xA;&lt;p&gt;&lt;strong&gt;Headless Server Environment&lt;/strong&gt;&lt;br&gt;&#xA;On servers without a graphical interface (Headless Server), if there is no long-running GPU client, each application start and stop will trigger driver loading and unloading. This is extremely common in &lt;strong&gt;High-Performance Computing (HPC)&lt;/strong&gt; and &lt;strong&gt;Data Center&lt;/strong&gt; environments, which was the root cause of this incident.&lt;/p&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;best-practice-recommendations&#34;&gt;Best Practice Recommendations&#xD;&#xA;&lt;/h2&gt;&lt;ol&gt;&#xA;&lt;li&gt;&lt;strong&gt;Strongly recommended for production environments&lt;/strong&gt; to enable GPU driver persistence, especially in headless server scenarios&lt;/li&gt;&#xA;&lt;li&gt;Use &lt;code&gt;systemd&lt;/code&gt; service to ensure persistence configuration automatically takes effect after system reboot&lt;/li&gt;&#xA;&lt;li&gt;The monitoring system should be thoroughly tested after enabling persistence to verify the stability of metric collection&lt;/li&gt;&#xA;&lt;li&gt;Regularly check the &lt;code&gt;nvidia-persistenced&lt;/code&gt; service status to ensure it is running properly&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;&lt;hr&gt;&#xA;&lt;h2 id=&#34;references&#34;&gt;References&#xD;&#xA;&lt;/h2&gt;&lt;ul&gt;&#xA;&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://docs.nvidia.com/deploy/driver-persistence/index.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;&#xD;&#xA;    &gt;NVIDIA Driver Persistence Official Documentation&lt;/a&gt;&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;</description>
        </item></channel>
</rss>
