Ascend NPU Deployment Guide for Kubernetes
Overview
This document describes the complete process of deploying the Ascend NPU containerized environment in a Kubernetes cluster, suitable for the following scenarios:
- Container Runtime: Containerd
- NPU Device: Ascend 310P
- System Architecture: aarch64 (ARM64)
- Kubernetes Version: 1.28+
The deployment process includes three main steps:
- Environment preparation (node labels, users, directories)
- Install Ascend Docker Runtime
- Deploy Ascend Device Plugin
Preparation
Create Node Labels
Add appropriate labels to Kubernetes nodes for subsequent Pod scheduling and resource management.
1
2
3
4
5
6
7
8
|
# Label master node
kubectl label nodes ecs-b0tf90001 masterselector=dls-master-node
# Label NPU compute nodes
kubectl label nodes ecs-exyqec0002 node-role.kubernetes.io/worker=worker
kubectl label nodes ecs-exyqec0002 workerselector=dls-worker-node
kubectl label nodes ecs-exyqec0002 host-arch=huawei-arm
kubectl label nodes ecs-exyqec0002 accelerator=huawei-Ascend310P
|
Note: Please modify ecs-b0tf90001 and ecs-exyqec0002 according to your actual node names.
Create System User
Create dedicated users and groups on Ascend compute nodes (e.g., ecs-exyqec0002).
1
2
3
4
5
6
7
8
|
# Create hwMindX user (UID 9000)
useradd -d /home/hwMindX -u 9000 -m -s /sbin/nologin hwMindX
# Create HwHiAiUser group
groupadd HwHiAiUser
# Add hwMindX user to HwHiAiUser group
usermod -a -G HwHiAiUser hwMindX
|
Important: UID 9000 and group HwHiAiUser are the default configurations for the Ascend software stack. Do not modify them arbitrarily.
Create Log Directory
Create the Device Plugin log directory on Ascend compute nodes.
1
2
3
|
# Create log directory and set permissions
mkdir -m 750 /var/log/mindx-dl/devicePlugin
chown root:root /var/log/mindx-dl/devicePlugin
|
Install Ascend Docker Runtime
Ascend Docker Runtime is a core component for using Ascend NPU in containerized environments and must be installed on all Ascend compute nodes.
Download Installation Package
Visit the official Git repository to download the corresponding version of the installation package.
Example: Ascend-docker-runtime_7.2.RC1.SPC2_linux-aarch64.run
Installation Steps
Step 1: Enter Installation Package Directory
1
|
cd <path to run package>
|
Step 2: Verify Package Integrity
1
|
./Ascend-docker-runtime_{version}_linux-{arch}.run --check
|
Expected output:
1
2
|
[WARNING]: --check is meaningless for Ascend-docker-runtime and will be discarded in the future
Verifying archive integrity... All good.
|
Step 3: Add Executable Permission
1
|
chmod u+x Ascend-docker-runtime_{version}_linux-{arch}.run
|
Step 4: Execute Installation
Method 1: Install to Default Path (Recommended)
1
|
./Ascend-docker-runtime_{version}_linux-{arch}.run --install
|
Method 2: Install to Custom Path
1
|
./Ascend-docker-runtime_{version}_linux-{arch}.run --install --install-path=<path>
|
Successful installation output example:
1
2
3
4
|
Uncompressing ascend-docker-runtime 100%
[INFO]: installing ascend docker runtime
...
[INFO] Ascend Docker Runtime install success
|
Default Installation Path: /usr/local/Ascend/Ascend-Docker-Runtime/
Modify the Containerd configuration file /etc/containerd/config.toml according to the cgroup version used by your system.
Configuration Method 1: Cgroup v1
Two key configuration items need to be modified:
runtime_type = "io.containerd.runtime.v1.linux"
runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"
Complete configuration example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runtime.v1.linux"
runtime_engine = ""
runtime_root = ""
privileged_without_host_devices = false
base_runtime_spec = ""
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
max_conf_num = 1
conf_template = ""
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".image_decryption]
key_model = ""
# ... other configurations ...
[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false
[plugins."io.containerd.runtime.v1.linux"]
shim = "containerd-shim"
runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"
runtime_root = ""
no_shim = false
shim_debug = false
[plugins."io.containerd.runtime.v2.task"]
platforms = ["linux/amd64"]
|
Configuration Method 2: Cgroup v2
The following key configuration item needs to be modified:
BinaryName = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"
Complete configuration example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v2.cri".containerd.runtimes.runc]
base_runtime_spec = ""
cni_conf_dir = ""
cni_max_conf_num = 0
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_path = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v2.cri".containerd.runtimes.runc.options]
BinaryName = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
|
Tip: To check the cgroup version, execute stat -fc %T /sys/fs/cgroup/. Output cgroup2fs indicates v2, while tmpfs indicates v1.
Restart Services
1
2
|
systemctl daemon-reload
systemctl restart containerd kubelet
|
Verify Installation
Execute the following command on the Kubernetes master node to confirm that the Ascend compute nodes are in normal status:
Expected output example:
1
2
3
|
NAME STATUS ROLES AGE VERSION
k8s-master Ready master,worker 3d v1.28.12
k8s-worker Ready worker 3d v1.28.12
|
All nodes should show Ready status.
Deploy Ascend Device Plugin
Ascend Device Plugin is responsible for managing and allocating NPU resources in Kubernetes.
Prepare Images
1. Pull Images
Execute the following commands on Ascend compute nodes to pull the required images:
1
2
3
4
5
6
7
8
9
|
# Pull all required images
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/resilience-controller:v7.1.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-operator:v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-controller-manager:v1.7.0-v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-scheduler:v1.7.0-v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:v7.2.RC1
|
2. Export Images
1
2
3
4
5
6
7
8
9
|
docker save -o ascend.tar \
swr.cn-south-1.myhuaweicloud.com/ascendhub/resilience-controller:v7.1.RC1 \
swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-operator:v7.2.RC1 \
swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v7.2.RC1 \
swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v7.2.RC1 \
swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-controller-manager:v1.7.0-v7.2.RC1 \
swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-scheduler:v1.7.0-v7.2.RC1 \
swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v7.2.RC1 \
swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:v7.2.RC1
|
3. Import to Containerd
1
|
ctr -n k8s.io images import ascend.tar
|
Download Deployment Configuration Files
Visit the official Git repository to download the Ascend Device Plugin installation package.
Example: Ascend-mindxdl-device-plugin_7.2.RC1.SPC2_linux-aarch64.zip
After extracting, copy the corresponding YAML files to the Kubernetes management node.
Choose the Appropriate YAML File
Select the corresponding YAML file based on the actual device type and whether to use the Volcano scheduler:
| YAML Filename |
Use Case |
device-plugin-310-v{version}.yaml |
Atlas 300I inference card, without Volcano |
device-plugin-310-volcano-v{version}.yaml |
Atlas 300I inference card, with Volcano |
device-plugin-310P-1usoc-v{version}.yaml |
Atlas 200I SoC A1 core board, without Volcano |
device-plugin-310P-1usoc-volcano-v{version}.yaml |
Atlas 200I SoC A1 core board, with Volcano |
device-plugin-310P-v{version}.yaml |
Atlas inference series products (e.g., 310P), without Volcano |
device-plugin-310P-volcano-v{version}.yaml |
Atlas inference series products, with Volcano |
device-plugin-910-v{version}.yaml |
Atlas training series products/A2/A3/800I A2, without Volcano |
device-plugin-volcano-v{version}.yaml |
Atlas training series products/A2/A3/800I A2, with Volcano |
Note:
- For Ascend 310P devices, typically choose
device-plugin-310P-v{version}.yaml
- Do not modify the
DaemonSet.metadata.name field in the YAML file to avoid automatic identification issues
Deploy Device Plugin
1
2
|
# Please modify the image name to: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v7.2.RC1
kubectl apply -f device-plugin-310P-v7.2.RC1.SPC2.yaml
|
Expected output:
1
2
3
4
|
serviceaccount/ascend-device-plugin-sa created
clusterrole.rbac.authorization.k8s.io/pods-node-ascend-device-plugin-role created
clusterrolebinding.rbac.authorization.k8s.io/pods-node-ascend-device-plugin-rolebinding created
daemonset.apps/ascend-device-plugin-daemonset created
|
Verify Deployment
Check if the Device Plugin started successfully:
1
|
kubectl get pod -n kube-system | grep ascend
|
Expected output (status should be Running):
1
2
|
NAME READY STATUS RESTARTS AGE
ascend-device-plugin-daemonset-d5ctz 1/1 Running 0 11s
|
Check node NPU resources:
1
|
kubectl describe node <node-name> | grep -A 5 "Capacity:"
|
You should see the huawei.com/Ascend310P resource.
Using NPU Compute Cards
Pod Configuration Example
Request NPU resources through the resources field in the Pod definition:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
apiVersion: v1
kind: Pod
metadata:
name: npu-test-pod
spec:
containers:
- name: alg-container
image: ubuntu:22.04
resources:
limits:
memory: 24Gi
huawei.com/Ascend310P: 1 # Request 1 NPU card
requests:
memory: 2Gi
huawei.com/Ascend310P: 1 # Request 1 NPU card
command: ["/bin/bash", "-c", "sleep infinity"]
|
Note:
- The value of
huawei.com/Ascend310P indicates the number of NPU cards requested
- Values in
limits and requests should be consistent
- Adjust the resource name according to the actual NPU model (e.g., 310, 910, etc.)