PaddlePaddle 在昇腾 310P 上的部署实践

概述

本文详细介绍了在昇腾 310P 平台上部署 PaddlePaddle 模型的多种方案，包括各方案的优缺点、实施步骤以及常见问题的解决方法。主要适用于 ASR（自动语音识别）和 OCR（光学字符识别）等应用场景。

方案概览

方案	描述	状态	推荐度
方案一	使用 paddle-custom-npu pip 包	❌ 存在依赖问题	⭐
方案二	编译 PaddleCustomDevice	⚠️ 精度问题待解决	⭐⭐
方案三	paddle2onnx + onnxruntime_cann	⚠️ 推理速度慢	⭐⭐⭐
方案四	paddle2onnx + OM 模型	⚠️ 动态 Shape 问题	⭐⭐⭐
方案五	PaddleX 高性能推理	✅ 推荐使用	⭐⭐⭐⭐⭐

方案一：pip 包安装方案

概述

采用官方提供的 paddle-custom-npu pip 包进行快速部署，但经测试存在运行时依赖库缺失问题。

部署步骤

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# 安装基础依赖
pip install psutil attrs decorator

# 安装 Python 包依赖
pip3 install py3Fdfs imageio pyheif whatimage shapely pyclipper minio \
    scikit-image imgaug lmdb pykafka gunicorn Pillow==9.5.0

# 安装 PaddlePaddle 和 NPU 支持
pip install paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
pip install paddle-custom-npu -i https://www.paddlepaddle.org.cn/packages/nightly/npu/
pip install paddleocr

# 修复版本兼容性问题
pip install protobuf==3.20.0

配置修改

1
2
3
4
5


# 修改 PaddleSpeech 执行器配置
vim /root/miniconda3/envs/asr_ocr/lib/python3.9/site-packages/paddlespeech/cli/executor.py +92

# 修改 Paddle 核心模块配置
vim /root/miniconda3/envs/asr_ocr/lib/python3.9/site-packages/paddle/fluid/core.py +386

存在问题

缺少运行时依赖库，无法正常进行推理预测
补充相关依赖库后仍报内部错误

方案二：编译安装方案

概述

通过编译 PaddleCustomDevice 源码进行安装，但存在精度支持问题。

参考资料

环境准备

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# 系统依赖安装
apt-get update -y && apt-get install -y \
    zlib1g zlib1g-dev libsqlite3-dev openssl libssl-dev libffi-dev \
    libbz2-dev libxslt1-dev unzip pciutils net-tools libblas-dev \
    gfortran libblas3 liblapack-dev liblapack3 libopenblas-dev git

# Python 依赖安装
pip install psutil attrs decorator pyyaml pathlib2 scipy requests \
    psutil absl-py sympy numpy==1.25.0 scipy

# CANN 工具链安装
./Ascend-cann-kernels-310p_8.0.RC2_linux.run --install
./Ascend-cann-nnal_8.0.RC2_linux-aarch64.run --install

# 算子包安装
wget -q https://paddle-ascend.bj.bcebos.com/code-share-master.zip --no-check-certificate
. /usr/local/Ascend/ascend-toolkit/set_env.sh
unzip code-share-master.zip
cd code-share-master/build && bash build_ops.sh
chmod +x aie_ops.run && ./aie_ops.run --extract=/usr/local/Ascend/

环境变量配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# 日志级别配置 (0:debug 1:info 2:warning 3:error 4:null)
export ASCEND_GLOBAL_LOG_LEVEL=3

# HCCL 配置
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_WHITELIST_DISABLE=1
export HCCL_SECURITY_MODE=1
export HCCL_BUFFSIZE=120

# PaddlePaddle NPU 配置
export FLAGS_npu_storage_format=0
export FLAGS_use_stride_kernel=0
export FLAGS_allocator_strategy=naive_best_fit
export PADDLE_XCCL_BACKEND=npu

编译安装

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# 进入 NPU 后端目录
cd PaddleCustomDevice/backends/npu

# 安装 PaddlePaddle CPU 版本
pip install paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/

# 配置编译选项
export WITH_TESTING=OFF

# 执行编译
bash tools/compile.sh

# 安装编译产物
pip install build/dist/paddle_custom_npu*.whl

功能验证

1
2
3
4
5
6
7
8
9


# 检查可用硬件后端
python -c "import paddle; print(paddle.device.get_all_custom_device_type())"
# 预期输出: ['npu']

# 检查版本信息
python -c "import paddle_custom_device; paddle_custom_device.npu.version()"

# PaddlePaddle 健康检查
python -c "import paddle; paddle.utils.run_check()"

存在问题

报错：The soc version does not support bf16 / fp32 for calculations, please change the setting of cubeMathType or the Dtype of input tensor.

方案三：ONNX Runtime CANN 方案

概述

使用 paddle2onnx 进行模型转换，配合 onnxruntime_cann 的 CANNExecutionProvider 进行推理。

依赖安装

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# 基础依赖
pip install psutil attrs decorator pyyaml pathlib2 scipy requests \
    psutil absl-py sympy numpy==1.25.0 scipy packaging

# 图像处理依赖
pip install opencv-python Pillow==9.5.0

# 项目依赖
pip install flask py3Fdfs imageio pyheif whatimage shapely pyclipper \
    minio scikit-image imgaug lmdb pykafka gunicorn protobuf==3.20.0 \
    Pyinstaller nacos-python-sdk filetype

模型转换

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


# OCR 识别模型转换
paddle2onnx --model_dir ./ch_PP-OCRv4_rec_infer \
    --model_filename inference.pdmodel \
    --params_filename inference.pdiparams \
    --save_file ./ch_PP-OCRv4_rec_infer.onnx \
    --opset_version 11 \
    --enable_onnx_checker True

# OCR 检测模型转换
paddle2onnx --model_dir ./ch_PP-OCRv4_det_infer \
    --model_filename inference.pdmodel \
    --params_filename inference.pdiparams \
    --save_file ./ch_PP-OCRv4_det_infer.onnx \
    --opset_version 11 \
    --enable_onnx_checker True

# 文本方向分类模型转换
paddle2onnx --model_dir ./ch_ppocr_mobile_v2.0_cls_infer \
    --model_filename inference.pdmodel \
    --params_filename inference.pdiparams \
    --save_file ./ch_ppocr_mobile_v2.0_cls.onnx \
    --opset_version 11 \
    --enable_onnx_checker True

性能优化

NPU 推理性能优化分析：

性能问题原因：

JIT 编译延迟：NPU 使用 aclop JIT 算子库，首次运行需要算子编译和缓存
算子回退：部分算子不支持会回退到 CPU，导致频繁的内存拷贝

优化方案：

1
2
3


# 禁用 JIT 编译，使用预编译算子
export FLAGS_npu_jit_compile=0
export FLAGS_use_stride_kernel=0

Warm-up 预热： 推理前进行 5-10 次预热，然后执行正式推理测试。

算子缓存机制： 执行后会生成 kernel_meta 目录，包含算子缓存文件，提升后续执行性能。

存在问题

推理速度相比 CPU 没有明显提升
需要手动进行 warm-up 操作
kernel_meta 缓存文件可能占用较大磁盘空间

方案四：OM 模型部署方案

概述

将 ONNX 模型转换为昇腾专用的 OM 模型格式，使用 ACL 接口进行推理。

模型转换流程

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# Step 1: Paddle 模型转 ONNX
paddle2onnx --model_dir ./ch_PP-OCRv4_rec_infer \
    --model_filename inference.pdmodel \
    --params_filename inference.pdiparams \
    --save_file ./ch_PP-OCRv4_rec_infer.onnx \
    --opset_version 11 \
    --enable_onnx_checker True

# Step 2: ONNX 模型转 OM
atc --model=./rec/ch_PP-OCRv4_rec_infer.onnx \
    --framework=5 \
    --input_format=NCHW \
    --output=./rec/ch_PP-OCRv4_rec_infer.om \
    --soc_version=Ascend310P3 \
    --input_shape="x:1,3,48,320"

技术难点

动态 Shape 问题：

1
2
3
4
5


# 模型信息示例
Input name  : x
Input shape : ['DynamicDimension.0', 3, 'DynamicDimension.1', 'DynamicDimension.2']
Output name : sigmoid_0.tmp_0  
Output shape: ['DynamicDimension.3', 1, 'DynamicDimension.4', 'DynamicDimension.5']

内存分配问题： get_output_size_by_index 返回 0，导致 acl.rt.malloc 分配内存失败。

解决方案

系统自动分配：创建空的 aclDataBuffer，系统内部自动分配内存
用户预估分配：根据最大可能输出预先分配内存

存在问题

动态 Shape 支持不完善
内存管理复杂
调试困难

方案五：PaddleX 高性能推理（推荐）

概述

使用 PaddleX 提供的高性能推理插件，支持固定 Shape 的 OM 模型推理，性能优秀且稳定可靠。

安装配置

1
2
3
4
5
6
7


# 安装 PaddleX
git clone https://github.com/PaddlePaddle/PaddleX.git
cd PaddleX
pip install -e ".[base]"

# 安装高性能推理插件
paddlex --install hpi-npu

手动编译（可选）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


cd PaddleX/libs/ultra-infer/python
unset http_proxy https_proxy

# 配置编译选项
export ENABLE_OM_BACKEND=ON ENABLE_ORT_BACKEND=ON 
export ENABLE_PADDLE_BACKEND=OFF WITH_GPU=OFF DEVICE_TYPE=NPU
export NPU_HOST_LIB=/usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/lib64

# 编译安装
python setup.py build
python setup.py bdist_wheel
pip install dist/ultra_infer_npu*.whl

模型转换

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


paddlex --install paddle2onnx

# 使用 PaddleX 转换为 ONNX
paddlex --paddle2onnx \
    --opset_version 11 \
    --paddle_model_dir <PaddlePaddle模型目录> \
    --onnx_model_dir <ONNX模型目录>

# 转换为 OM 模型
atc --model=inference.onnx \
    --framework=5 \
    --output=inference \
    --soc_version=Ascend310P3 \
    --input_shape "x:1,3,48,320"

# FP32 精度转换
atc --model=inference.onnx \
    --framework=5 \
    --output=inference \
    --soc_version=Ascend310P3 \
    --input_shape "x:1,3,48,320" \
    --precision_mode_v2=origin

推理代码示例

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338


# -*- encoding=utf-8 -*-
"""
    @Author: Kang
    @Modified by: 
    @Datetime: 2025/07/15 13:57
    @Description: OCR 测试脚本
"""
import os
import time
import copy
from pathlib import Path
from typing import List, Dict, Any

import cv2
import numpy as np
from loguru import logger

from paddlex import create_model


class OCRProcessor:
    """OCR处理器类"""
    
    def __init__(self, 
                 det_model_dir: str = "/opt/models/ocr/PP-OCRv4_server_det_infer_om_310P",
                 rec_model_dir: str = "/opt/models/ocr/PP-OCRv4_server_rec_infer_om_310P",
                 ori_model_dir: str = "/opt/models/ocr/PP-LCNet_x1_0_textline_ori_infer",
                 device: str = "npu:0",
                 output_dir: str = "/opt/output/"):
        
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # 配置参数
        hpi_config = {
            "auto_config": False,
            "backend": "om",
        }
        
        # 初始化分类模型
        logger.info("正在加载分类模型...")
        self.model_ori = create_model(
            model_name="PP-LCNet_x1_0_textline_ori",
            model_dir=ori_model_dir,
        )

        # 初始化检测模型
        logger.info("正在加载检测模型...")
        self.model_det = create_model(
            model_name="PP-OCRv4_server_det",
            model_dir=det_model_dir,
            device=device,
            use_hpip=True,
            hpi_config=hpi_config,
            input_shape=[3, 640, 480]
        )
        
        # 初始化识别模型
        logger.info("正在加载识别模型...")
        self.model_rec = create_model(
            model_name="PP-OCRv4_server_rec",
            model_dir=rec_model_dir,
            device=device,
            use_hpip=True,
            hpi_config=hpi_config,
            input_shape=[3, 48, 320]
        )
        logger.info("模型加载完成")
    
    def _crop_by_polys(self, img: np.ndarray, dt_polys: List[list]) -> List[dict]:
        """
        根据检测框裁剪图片

        Args:
            img (nd.ndarray): 输入图片
            dt_polys (list[list]): 检测框列表

        Returns:
            list[dict]: 裁剪后的图片列表

        Raises:
            NotImplementedError: 如果 det_box_type 不是 'quad' 或 'poly'
        """

        dt_boxes = np.array(dt_polys)
        output_list = []
        for bno in range(len(dt_boxes)):
            tmp_box = copy.deepcopy(dt_boxes[bno])
            img_crop = self.get_minarea_rect_crop(img, tmp_box)
            output_list.append(img_crop)

        return output_list

    def get_minarea_rect_crop(self, img: np.ndarray, points: np.ndarray) -> np.ndarray:
        """
        根据给定的图片和点获取最小面积矩形裁剪

        Args:
            img (np.ndarray): 输入图片
            points (np.ndarray): 定义要裁剪的形状的点列表

        Returns:
            np.ndarray: 最小面积矩形裁剪后的图片
        """
        bounding_box = cv2.minAreaRect(np.array(points).astype(np.int32))
        points = sorted(list(cv2.boxPoints(bounding_box)), key=lambda x: x[0])

        index_a, index_b, index_c, index_d = 0, 1, 2, 3
        if points[1][1] > points[0][1]:
            index_a = 0
            index_d = 1
        else:
            index_a = 1
            index_d = 0
        if points[3][1] > points[2][1]:
            index_b = 2
            index_c = 3
        else:
            index_b = 3
            index_c = 2

        box = [points[index_a], points[index_b], points[index_c], points[index_d]]
        crop_img = self.get_rotate_crop_image(img, np.array(box))
        return crop_img

    def _rotate_image(self, image_array_list: List[np.ndarray], rotate_angle_list: List[int]):
        assert len(image_array_list) == len(
            rotate_angle_list
        ), f"Length of image ({len(image_array_list)}) must match length of angle ({len(rotate_angle_list)})"

        for angle in rotate_angle_list:
            assert angle in [0, 1], f"rotate_angle must be 0 or 1, now it's {angle}"

        rotated_images = []
        for image_array, rotate_indicator in zip(image_array_list, rotate_angle_list):
            # Convert 0/1 indicator to actual rotation angle
            rotate_angle = rotate_indicator * 180
            if rotate_angle < 0 or rotate_angle >= 360:
                raise ValueError("`angle` should be in range [0, 360)")

            if rotate_angle < 1e-7:
                rotated_images.append(image_array)
                continue

            # Should we align corners?
            h, w = image_array.shape[:2]
            center = (w / 2, h / 2)
            scale = 1.0
            mat = cv2.getRotationMatrix2D(center, rotate_angle, scale)
            cos = np.abs(mat[0, 0])
            sin = np.abs(mat[0, 1])
            new_w = int((h * sin) + (w * cos))
            new_h = int((h * cos) + (w * sin))
            mat[0, 2] += (new_w - w) / 2
            mat[1, 2] += (new_h - h) / 2
            dst_size = (new_w, new_h)

            rotated = cv2.warpAffine(
                image_array,
                mat,
                dst_size,
                flags=cv2.INTER_CUBIC,
            )
            rotated_images.append(rotated)
        logger.info(f"旋转后的图片数量: {len(rotated_images)}")
        return rotated_images

    def get_rotate_crop_image(self, img: np.ndarray, points: list) -> np.ndarray:
        """
        根据给定的四个点裁剪并旋转输入图片，形成透视变换后的图片

        Args:
            img (np.ndarray): 输入图片数组
            points (list): 定义裁剪区域的四个2D点列表

        Returns:
            np.ndarray: 变换后的图片数组
        """
        assert len(points) == 4, "shape of points must be 4*2"
        img_crop_width = int(
            max(
                np.linalg.norm(points[0] - points[1]),
                np.linalg.norm(points[2] - points[3]),
            )
        )
        img_crop_height = int(
            max(
                np.linalg.norm(points[0] - points[3]),
                np.linalg.norm(points[1] - points[2]),
            )
        )
        pts_std = np.float32(
            [
                [0, 0],
                [img_crop_width, 0],
                [img_crop_width, img_crop_height],
                [0, img_crop_height],
            ]
        )
        M = cv2.getPerspectiveTransform(points, pts_std)
        dst_img = cv2.warpPerspective(
            img,
            M,
            (img_crop_width, img_crop_height),
            borderMode=cv2.BORDER_REPLICATE,
            flags=cv2.INTER_CUBIC,
        )
        dst_img_height, dst_img_width = dst_img.shape[0:2]
        if dst_img_height * 1.0 / dst_img_width >= 1.5:
            dst_img = np.rot90(dst_img)
        return dst_img

    def process_image(self, image_path: str) -> List[Dict[str, Any]]:
        """
        处理单张图片进行OCR识别
        
        Args:
            image_path: 图片路径
            
        Returns:
            识别结果列表
        """
        if not os.path.exists(image_path):
            logger.error(f"图片文件不存在: {image_path}")
            return []
            
        try:
            # 读取图片
            image = cv2.imread(image_path)
            if image is None:
                logger.error(f"无法读取图片: {image_path}")
                return []
                
            logger.info(f"开始处理图片: {image_path}")
            start_time = time.time()
            
            # 文本检测
            logger.info(f"开始文本检测")
            det_start = time.time()
            output_det = self.model_det.predict(image)
            det_time = time.time() - det_start
            logger.info(f"检测耗时-文本检测: {det_time}s")
            
            results = []
            result_det = []
            for res in output_det:
                det_polys = res.get("dt_polys", [])
                det_scores = res.get("dt_scores", [])
                for idx, det_poly in enumerate(det_polys):
                    result_det.append({
                        "idx": idx,
                        "dt_polys": det_poly,
                        "dt_scores": det_scores[idx],
                    })
                logger.info(f"检测到 {len(det_polys)} 个文本区域")
            images_det = self._crop_by_polys(image, det_polys)
            for idx, img in enumerate(images_det):
                cv2.imwrite(f"{self.output_dir}/cropped_det_{idx}.png", img)
                    
            # 文本分类-方向判断
            logger.info(f"开始文本方向分类")
            ori_start = time.time()
            output_ori = self.model_ori.predict(images_det)
            ori_time = time.time() - ori_start
            logger.info(f"检测耗时-文本分类: {ori_time}s")
            angles = [
                int(ori_res["class_ids"][0])
                for ori_res in output_ori
            ]
            images_ori = self._rotate_image(images_det, angles)
            for idx, img in enumerate(images_ori):
                cv2.imwrite(f"{self.output_dir}/cropped_ori_{idx}.png", img)
            
            # 文本识别
            logger.info(f"开始文本识别")
            rec_start = time.time()
            for item in result_det:
                output_rec = self.model_rec.predict(images_ori[item["idx"]])
                for rec_res in output_rec:
                    rec_text = rec_res.get("rec_text", "")
                    rec_score = rec_res.get("rec_score", 0)
                    results.append({
                        "idx": item["idx"],
                        "dt_polys": item["dt_polys"].tolist(),
                        "dt_scores": item["dt_scores"],
                        "rec_res": rec_text,
                        "rec_score": rec_score,
                    })
            rec_time = time.time() - rec_start
            logger.info(f"检测耗时-文本识别: {rec_time}s")

            total_time = time.time() - start_time
            logger.info(f"总处理耗时: {total_time:.3f}s")
            
            return results
            
        except Exception as e:
            logger.exception(f"处理图片失败: {e}")
            return []
    
    def batch_process(self, image_paths: List[str]) -> Dict[str, List[Dict[str, Any]]]:
        """
        批量处理图片
        
        Args:
            image_paths: 图片路径列表
            
        Returns:
            批量处理结果
        """
        batch_results = {}
        
        for image_path in image_paths:
            results = self.process_image(image_path)
            batch_results[image_path] = results
            
        return batch_results


def main():
    """主函数"""
    # 初始化OCR处理器
    ocr_processor = OCRProcessor(
        output_dir="/opt/output/"
    )
    
    # 处理单张图片
    image_path = "/opt/test/test.png"
    results = ocr_processor.process_image(image_path)
    
    # 输出结果
    print("\n=== OCR 识别结果 ===")
    for result in results:
        print(result)


if __name__ == "__main__":
    main()

支持的模型

模型类型	模型名称	输入形状	芯片支持
文本检测	PP-OCRv4_mobile_det	(1,3,640,480)	910B/310P/310B
文本识别	PP-OCRv4_mobile_rec	(1,3,48,320)	910B/310P/310B
图像分类	ResNet50	(1,3,224,224)	910B/310P/310B
目标检测	RT-DETR-L	多输入	910B/310P/310B

总结与建议

方案对比

方案五（PaddleX）：✅ 强烈推荐，官方支持，性能优秀，稳定可靠
方案三（ONNX Runtime）：⚠️ 可用但性能一般，适合快速验证
方案四（OM 模型）：⚠️ 技术难度高，适合深度定制
方案一、二：❌ 不推荐，存在较多问题

最佳实践

优先选择方案五：使用 PaddleX 高性能推理插件
固定输入形状：避免动态 Shape 带来的复杂性
模型预热：推理前进行 warm-up 操作
监控性能：使用昇腾 CANN Profiling 工具优化性能
版本管理：保持 CANN 工具链和 PaddleX 版本同步更新

性能优化建议

使用 FP16 精度提升推理速度
合理配置批处理大小
利用算子缓存机制
定期清理 kernel_meta 临时文件

通过本文的详细介绍，开发者可以根据实际需求选择合适的部署方案，并参考相应的问题解决方法，在昇腾 310P 平台上成功部署 PaddlePaddle 模型。