Featured image of post Ascend NPU Deployment Guide for Kubernetes

Ascend NPU Deployment Guide for Kubernetes

A comprehensive guide on deploying Ascend NPU containerized environment in Kubernetes clusters

Ascend NPU Deployment Guide for Kubernetes

Overview

This document describes the complete process of deploying the Ascend NPU containerized environment in a Kubernetes cluster, suitable for the following scenarios:

  • Container Runtime: Containerd
  • NPU Device: Ascend 310P
  • System Architecture: aarch64 (ARM64)
  • Kubernetes Version: 1.28+

The deployment process includes three main steps:

  1. Environment preparation (node labels, users, directories)
  2. Install Ascend Docker Runtime
  3. Deploy Ascend Device Plugin

Preparation

Create Node Labels

Add appropriate labels to Kubernetes nodes for subsequent Pod scheduling and resource management.

1
2
3
4
5
6
7
8
# Label master node
kubectl label nodes ecs-b0tf90001 masterselector=dls-master-node

# Label NPU compute nodes
kubectl label nodes ecs-exyqec0002 node-role.kubernetes.io/worker=worker
kubectl label nodes ecs-exyqec0002 workerselector=dls-worker-node
kubectl label nodes ecs-exyqec0002 host-arch=huawei-arm
kubectl label nodes ecs-exyqec0002 accelerator=huawei-Ascend310P

Note: Please modify ecs-b0tf90001 and ecs-exyqec0002 according to your actual node names.

Create System User

Create dedicated users and groups on Ascend compute nodes (e.g., ecs-exyqec0002).

1
2
3
4
5
6
7
8
# Create hwMindX user (UID 9000)
useradd -d /home/hwMindX -u 9000 -m -s /sbin/nologin hwMindX

# Create HwHiAiUser group
groupadd HwHiAiUser

# Add hwMindX user to HwHiAiUser group
usermod -a -G HwHiAiUser hwMindX

Important: UID 9000 and group HwHiAiUser are the default configurations for the Ascend software stack. Do not modify them arbitrarily.

Create Log Directory

Create the Device Plugin log directory on Ascend compute nodes.

1
2
3
# Create log directory and set permissions
mkdir -m 750 /var/log/mindx-dl/devicePlugin
chown root:root /var/log/mindx-dl/devicePlugin

Install Ascend Docker Runtime

Ascend Docker Runtime is a core component for using Ascend NPU in containerized environments and must be installed on all Ascend compute nodes.

Download Installation Package

Visit the official Git repository to download the corresponding version of the installation package.

Example: Ascend-docker-runtime_7.2.RC1.SPC2_linux-aarch64.run

Installation Steps

Step 1: Enter Installation Package Directory

1
cd <path to run package>

Step 2: Verify Package Integrity

1
./Ascend-docker-runtime_{version}_linux-{arch}.run --check

Expected output:

1
2
[WARNING]: --check is meaningless for Ascend-docker-runtime and will be discarded in the future
Verifying archive integrity... All good.

Step 3: Add Executable Permission

1
chmod u+x Ascend-docker-runtime_{version}_linux-{arch}.run

Step 4: Execute Installation

Method 1: Install to Default Path (Recommended)

1
./Ascend-docker-runtime_{version}_linux-{arch}.run --install

Method 2: Install to Custom Path

1
./Ascend-docker-runtime_{version}_linux-{arch}.run --install --install-path=<path>

Successful installation output example:

1
2
3
4
Uncompressing ascend-docker-runtime  100%
[INFO]: installing ascend docker runtime
...
[INFO] Ascend Docker Runtime install success

Default Installation Path: /usr/local/Ascend/Ascend-Docker-Runtime/

Configure Containerd

Modify the Containerd configuration file /etc/containerd/config.toml according to the cgroup version used by your system.

Configuration Method 1: Cgroup v1

Two key configuration items need to be modified:

  • runtime_type = "io.containerd.runtime.v1.linux"
  • runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"

Complete configuration example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
    runtime_type = "io.containerd.runtime.v1.linux"
    runtime_engine = ""
    runtime_root = ""
    privileged_without_host_devices = false
    base_runtime_spec = ""
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"
  max_conf_num = 1
  conf_template = ""

[plugins."io.containerd.grpc.v1.cri".registry]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
      endpoint = ["https://registry-1.docker.io"]

[plugins."io.containerd.grpc.v1.cri".image_decryption]
  key_model = ""

# ... other configurations ...

[plugins."io.containerd.monitor.v1.cgroups"]
  no_prometheus = false

[plugins."io.containerd.runtime.v1.linux"]
  shim = "containerd-shim"
  runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"
  runtime_root = ""
  no_shim = false
  shim_debug = false

[plugins."io.containerd.runtime.v2.task"]
  platforms = ["linux/amd64"]

Configuration Method 2: Cgroup v2

The following key configuration item needs to be modified:

  • BinaryName = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"

Complete configuration example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
  [plugins."io.containerd.grpc.v2.cri".containerd.runtimes.runc]
    base_runtime_spec = ""
    cni_conf_dir = ""
    cni_max_conf_num = 0
    container_annotations = []
    pod_annotations = []
    privileged_without_host_devices = false
    runtime_engine = ""
    runtime_path = ""
    runtime_root = ""
    runtime_type = "io.containerd.runc.v2"
    
    [plugins."io.containerd.grpc.v2.cri".containerd.runtimes.runc.options]
      BinaryName = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"
      CriuImagePath = ""
      CriuPath = ""
      CriuWorkPath = ""
      IoGid = 0
      IoUid = 0
      NoNewKeyring = false
      NoPivotRoot = false
      Root = ""
      ShimCgroup = ""
      SystemdCgroup = true

Tip: To check the cgroup version, execute stat -fc %T /sys/fs/cgroup/. Output cgroup2fs indicates v2, while tmpfs indicates v1.

Restart Services

1
2
systemctl daemon-reload
systemctl restart containerd kubelet

Verify Installation

Execute the following command on the Kubernetes master node to confirm that the Ascend compute nodes are in normal status:

1
kubectl get nodes

Expected output example:

1
2
3
NAME              STATUS   ROLES           AGE   VERSION
k8s-master        Ready    master,worker   3d    v1.28.12
k8s-worker        Ready    worker          3d    v1.28.12

All nodes should show Ready status.


Deploy Ascend Device Plugin

Ascend Device Plugin is responsible for managing and allocating NPU resources in Kubernetes.

Prepare Images

1. Pull Images

Execute the following commands on Ascend compute nodes to pull the required images:

1
2
3
4
5
6
7
8
9
# Pull all required images
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/resilience-controller:v7.1.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-operator:v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-controller-manager:v1.7.0-v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-scheduler:v1.7.0-v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v7.2.RC1
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:v7.2.RC1

2. Export Images

1
2
3
4
5
6
7
8
9
docker save -o ascend.tar \
  swr.cn-south-1.myhuaweicloud.com/ascendhub/resilience-controller:v7.1.RC1 \
  swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-operator:v7.2.RC1 \
  swr.cn-south-1.myhuaweicloud.com/ascendhub/npu-exporter:v7.2.RC1 \
  swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v7.2.RC1 \
  swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-controller-manager:v1.7.0-v7.2.RC1 \
  swr.cn-south-1.myhuaweicloud.com/ascendhub/vc-scheduler:v1.7.0-v7.2.RC1 \
  swr.cn-south-1.myhuaweicloud.com/ascendhub/noded:v7.2.RC1 \
  swr.cn-south-1.myhuaweicloud.com/ascendhub/clusterd:v7.2.RC1

3. Import to Containerd

1
ctr -n k8s.io images import ascend.tar

Download Deployment Configuration Files

Visit the official Git repository to download the Ascend Device Plugin installation package.

Example: Ascend-mindxdl-device-plugin_7.2.RC1.SPC2_linux-aarch64.zip

After extracting, copy the corresponding YAML files to the Kubernetes management node.

Choose the Appropriate YAML File

Select the corresponding YAML file based on the actual device type and whether to use the Volcano scheduler:

YAML Filename Use Case
device-plugin-310-v{version}.yaml Atlas 300I inference card, without Volcano
device-plugin-310-volcano-v{version}.yaml Atlas 300I inference card, with Volcano
device-plugin-310P-1usoc-v{version}.yaml Atlas 200I SoC A1 core board, without Volcano
device-plugin-310P-1usoc-volcano-v{version}.yaml Atlas 200I SoC A1 core board, with Volcano
device-plugin-310P-v{version}.yaml Atlas inference series products (e.g., 310P), without Volcano
device-plugin-310P-volcano-v{version}.yaml Atlas inference series products, with Volcano
device-plugin-910-v{version}.yaml Atlas training series products/A2/A3/800I A2, without Volcano
device-plugin-volcano-v{version}.yaml Atlas training series products/A2/A3/800I A2, with Volcano

Note:

  • For Ascend 310P devices, typically choose device-plugin-310P-v{version}.yaml
  • Do not modify the DaemonSet.metadata.name field in the YAML file to avoid automatic identification issues

Deploy Device Plugin

1
2
# Please modify the image name to: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v7.2.RC1
kubectl apply -f device-plugin-310P-v7.2.RC1.SPC2.yaml

Expected output:

1
2
3
4
serviceaccount/ascend-device-plugin-sa created
clusterrole.rbac.authorization.k8s.io/pods-node-ascend-device-plugin-role created
clusterrolebinding.rbac.authorization.k8s.io/pods-node-ascend-device-plugin-rolebinding created
daemonset.apps/ascend-device-plugin-daemonset created

Verify Deployment

Check if the Device Plugin started successfully:

1
kubectl get pod -n kube-system | grep ascend

Expected output (status should be Running):

1
2
NAME                                   READY   STATUS    RESTARTS   AGE
ascend-device-plugin-daemonset-d5ctz   1/1     Running   0          11s

Check node NPU resources:

1
kubectl describe node <node-name> | grep -A 5 "Capacity:"

You should see the huawei.com/Ascend310P resource.


Using NPU Compute Cards

Pod Configuration Example

Request NPU resources through the resources field in the Pod definition:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: v1
kind: Pod
metadata:
  name: npu-test-pod
spec:
  containers:
    - name: alg-container
      image: ubuntu:22.04
      resources:
        limits:
          memory: 24Gi
          huawei.com/Ascend310P: 1  # Request 1 NPU card
        requests:
          memory: 2Gi
          huawei.com/Ascend310P: 1  # Request 1 NPU card
      command: ["/bin/bash", "-c", "sleep infinity"]

Note:

  • The value of huawei.com/Ascend310P indicates the number of NPU cards requested
  • Values in limits and requests should be consistent
  • Adjust the resource name according to the actual NPU model (e.g., 310, 910, etc.)
Facing the sea with spring blossoms.
Built with Hugo
Theme Stack designed by Jimmy