PaddlePaddle Deployment on Ascend 310P: A Comprehensive Guide

Overview

This article provides a comprehensive guide for deploying PaddlePaddle models on the Ascend 310P platform. It covers multiple deployment approaches, their pros and cons, implementation steps, and solutions to common issues. The guide is particularly useful for ASR (Automatic Speech Recognition) and OCR (Optical Character Recognition) applications.

Deployment Approaches Overview

Approach	Description	Status	Recommendation
Approach 1	Using paddle-custom-npu pip package	❌ Dependency issues	⭐
Approach 2	Compiling PaddleCustomDevice	⚠️ Precision issues	⭐⭐
Approach 3	paddle2onnx + onnxruntime_cann	⚠️ Slow inference	⭐⭐⭐
Approach 4	paddle2onnx + OM model	⚠️ Dynamic shape issues	⭐⭐⭐
Approach 5	PaddleX High-Performance Inference	✅ Recommended	⭐⭐⭐⭐⭐

Approach 1: Pip Package Installation

Overview

This approach uses the official paddle-custom-npu pip package for quick deployment, but testing revealed missing runtime dependencies.

Deployment Steps

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# Install basic dependencies
pip install psutil attrs decorator

# Install Python package dependencies
pip3 install py3Fdfs imageio pyheif whatimage shapely pyclipper minio \
    scikit-image imgaug lmdb pykafka gunicorn Pillow==9.5.0

# Install PaddlePaddle and NPU support
pip install paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
pip install paddle-custom-npu -i https://www.paddlepaddle.org.cn/packages/nightly/npu/
pip install paddleocr

# Fix version compatibility issues
pip install protobuf==3.20.0

Configuration Modifications

1
2
3
4
5


# Modify PaddleSpeech executor configuration
vim /root/miniconda3/envs/asr_ocr/lib/python3.9/site-packages/paddlespeech/cli/executor.py +92

# Modify Paddle core module configuration
vim /root/miniconda3/envs/asr_ocr/lib/python3.9/site-packages/paddle/fluid/core.py +386

Issues Encountered

Missing runtime dependencies preventing normal inference
Internal errors persist even after adding related dependencies

Approach 2: Compilation Installation

Overview

Installing by compiling PaddleCustomDevice source code, but encountering precision support issues.

References

Environment Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# System dependencies installation
apt-get update -y && apt-get install -y \
    zlib1g zlib1g-dev libsqlite3-dev openssl libssl-dev libffi-dev \
    libbz2-dev libxslt1-dev unzip pciutils net-tools libblas-dev \
    gfortran libblas3 liblapack-dev liblapack3 libopenblas-dev git

# Python dependencies installation
pip install psutil attrs decorator pyyaml pathlib2 scipy requests \
    psutil absl-py sympy numpy==1.25.0 scipy

# CANN toolkit installation
./Ascend-cann-kernels-310p_8.0.RC2_linux.run --install
./Ascend-cann-nnal_8.0.RC2_linux-aarch64.run --install

# Operator package installation
wget -q https://paddle-ascend.bj.bcebos.com/code-share-master.zip --no-check-certificate
. /usr/local/Ascend/ascend-toolkit/set_env.sh
unzip code-share-master.zip
cd code-share-master/build && bash build_ops.sh
chmod +x aie_ops.run && ./aie_ops.run --extract=/usr/local/Ascend/

Environment Variables Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# Log level configuration (0:debug 1:info 2:warning 3:error 4:null)
export ASCEND_GLOBAL_LOG_LEVEL=3

# HCCL configuration
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_WHITELIST_DISABLE=1
export HCCL_SECURITY_MODE=1
export HCCL_BUFFSIZE=120

# PaddlePaddle NPU configuration
export FLAGS_npu_storage_format=0
export FLAGS_use_stride_kernel=0
export FLAGS_allocator_strategy=naive_best_fit
export PADDLE_XCCL_BACKEND=npu

Compilation and Installation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# Enter NPU backend directory
cd PaddleCustomDevice/backends/npu

# Install PaddlePaddle CPU version
pip install paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/

# Configure compilation options
export WITH_TESTING=OFF

# Execute compilation
bash tools/compile.sh

# Install compilation artifacts
pip install build/dist/paddle_custom_npu*.whl

Functionality Verification

1
2
3
4
5
6
7
8
9


# Check available hardware backends
python -c "import paddle; print(paddle.device.get_all_custom_device_type())"
# Expected output: ['npu']

# Check version information
python -c "import paddle_custom_device; paddle_custom_device.npu.version()"

# PaddlePaddle health check
python -c "import paddle; paddle.utils.run_check()"

Issues Encountered

Error: The soc version does not support bf16 / fp32 for calculations, please change the setting of cubeMathType or the Dtype of input tensor.

Approach 3: ONNX Runtime CANN Solution

Overview

Using paddle2onnx for model conversion, combined with onnxruntime_cann’s CANNExecutionProvider for inference.

Dependencies Installation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


# Basic dependencies
pip install psutil attrs decorator pyyaml pathlib2 scipy requests \
    psutil absl-py sympy numpy==1.25.0 scipy packaging

# Image processing dependencies
pip install opencv-python Pillow==9.5.0

# Project dependencies
pip install flask py3Fdfs imageio pyheif whatimage shapely pyclipper \
    minio scikit-image imgaug lmdb pykafka gunicorn protobuf==3.20.0 \
    Pyinstaller nacos-python-sdk filetype

Model Conversion

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


# OCR recognition model conversion
paddle2onnx --model_dir ./ch_PP-OCRv4_rec_infer \
    --model_filename inference.pdmodel \
    --params_filename inference.pdiparams \
    --save_file ./ch_PP-OCRv4_rec_infer.onnx \
    --opset_version 11 \
    --enable_onnx_checker True

# OCR detection model conversion
paddle2onnx --model_dir ./ch_PP-OCRv4_det_infer \
    --model_filename inference.pdmodel \
    --params_filename inference.pdiparams \
    --save_file ./ch_PP-OCRv4_det_infer.onnx \
    --opset_version 11 \
    --enable_onnx_checker True

# Text direction classification model conversion
paddle2onnx --model_dir ./ch_ppocr_mobile_v2.0_cls_infer \
    --model_filename inference.pdmodel \
    --params_filename inference.pdiparams \
    --save_file ./ch_ppocr_mobile_v2.0_cls.onnx \
    --opset_version 11 \
    --enable_onnx_checker True

Performance Optimization

NPU inference performance optimization analysis:

Performance Issue Causes:

JIT Compilation Delay: NPU uses aclop JIT operator library, requiring operator compilation and caching on first run
Operator Fallback: Unsupported operators fall back to CPU, causing frequent memory copying

Optimization Solutions:

1
2
3


# Disable JIT compilation, use pre-compiled operators
export FLAGS_npu_jit_compile=0
export FLAGS_use_stride_kernel=0

Warm-up Preheating: Perform 5-10 warm-up runs before formal inference testing.

Operator Caching Mechanism: After execution, a kernel_meta directory is generated containing operator cache files to improve subsequent execution performance.

Issues Encountered

No significant inference speed improvement compared to CPU
Manual warm-up operations required
kernel_meta cache files may consume significant disk space

Approach 4: OM Model Deployment

Overview

Converting ONNX models to Ascend-specific OM model format for inference using ACL interfaces.

Model Conversion Process

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# Step 1: Paddle model to ONNX
paddle2onnx --model_dir ./ch_PP-OCRv4_rec_infer \
    --model_filename inference.pdmodel \
    --params_filename inference.pdiparams \
    --save_file ./ch_PP-OCRv4_rec_infer.onnx \
    --opset_version 11 \
    --enable_onnx_checker True

# Step 2: ONNX model to OM
atc --model=./rec/ch_PP-OCRv4_rec_infer.onnx \
    --framework=5 \
    --input_format=NCHW \
    --output=./rec/ch_PP-OCRv4_rec_infer.om \
    --soc_version=Ascend310P3 \
    --input_shape="x:1,3,48,320"

Technical Challenges

Dynamic Shape Issues:

1
2
3
4
5


# Model information example
Input name  : x
Input shape : ['DynamicDimension.0', 3, 'DynamicDimension.1', 'DynamicDimension.2']
Output name : sigmoid_0.tmp_0  
Output shape: ['DynamicDimension.3', 1, 'DynamicDimension.4', 'DynamicDimension.5']

Memory Allocation Issues: get_output_size_by_index returns 0, causing acl.rt.malloc memory allocation failure.

Solutions

System Auto-allocation: Create empty aclDataBuffer, system automatically allocates memory
User Pre-allocation: Pre-allocate memory based on maximum possible output

Issues Encountered

Incomplete dynamic shape support
Complex memory management
Difficult debugging

Approach 5: PaddleX High-Performance Inference (Recommended)

Overview

Using PaddleX’s high-performance inference plugin, supporting fixed-shape OM model inference with excellent performance and stability.

Installation and Configuration

1
2
3
4
5
6
7


# Install PaddleX
git clone https://github.com/PaddlePaddle/PaddleX.git
cd PaddleX
pip install -e ".[base]"

# Install high-performance inference plugin
paddlex --install hpi-npu

Manual Compilation (Optional)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


cd PaddleX/libs/ultra-infer/python
unset http_proxy https_proxy

# Configure compilation options
export ENABLE_OM_BACKEND=ON ENABLE_ORT_BACKEND=ON 
export ENABLE_PADDLE_BACKEND=OFF WITH_GPU=OFF DEVICE_TYPE=NPU
export NPU_HOST_LIB=/usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/lib64

# Compile and install
python setup.py build
python setup.py bdist_wheel
pip install dist/ultra_infer_npu*.whl

Model Conversion

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


paddlex --install paddle2onnx

# Convert to ONNX using PaddleX
paddlex --paddle2onnx \
    --opset_version 11 \
    --paddle_model_dir <PaddlePaddle_model_directory> \
    --onnx_model_dir <ONNX_model_directory>

# Convert to OM model
atc --model=inference.onnx \
    --framework=5 \
    --output=inference \
    --soc_version=Ascend310P3 \
    --input_shape "x:1,3,48,320"

# FP32 precision conversion
atc --model=inference.onnx \
    --framework=5 \
    --output=inference \
    --soc_version=Ascend310P3 \
    --input_shape "x:1,3,48,320" \
    --precision_mode_v2=origin

Inference Code Example

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338


# -*- encoding=utf-8 -*-
"""
    @Author: Kang
    @Modified by: 
    @Datetime: 2025/07/15 13:57
    @Description: OCR test script
"""
import os
import time
import copy
from pathlib import Path
from typing import List, Dict, Any

import cv2
import numpy as np
from loguru import logger

from paddlex import create_model


class OCRProcessor:
    """OCR processor class"""
    
    def __init__(self, 
                 det_model_dir: str = "/opt/models/ocr/PP-OCRv4_server_det_infer_om_310P",
                 rec_model_dir: str = "/opt/models/ocr/PP-OCRv4_server_rec_infer_om_310P",
                 ori_model_dir: str = "/opt/models/ocr/PP-LCNet_x1_0_textline_ori_infer",
                 device: str = "npu:0",
                 output_dir: str = "/opt/output/"):
        
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Configure parameters
        hpi_config = {
            "auto_config": False,
            "backend": "om",
        }
        
        # Initialize classification model
        logger.info("Loading classification model...")
        self.model_ori = create_model(
            model_name="PP-LCNet_x1_0_textline_ori",
            model_dir=ori_model_dir,
        )

        # Initialize detection model
        logger.info("Loading detection model...")
        self.model_det = create_model(
            model_name="PP-OCRv4_server_det",
            model_dir=det_model_dir,
            device=device,
            use_hpip=True,
            hpi_config=hpi_config,
            input_shape=[3, 640, 480]
        )
        
        # Initialize recognition model
        logger.info("Loading recognition model...")
        self.model_rec = create_model(
            model_name="PP-OCRv4_server_rec",
            model_dir=rec_model_dir,
            device=device,
            use_hpip=True,
            hpi_config=hpi_config,
            input_shape=[3, 48, 320]
        )
        logger.info("Model loaded successfully")
    
    def _crop_by_polys(self, img: np.ndarray, dt_polys: List[list]) -> List[dict]:
        """
        Call method to crop images based on detection boxes.

        Args:
            img (nd.ndarray): The input image.
            dt_polys (list[list]): List of detection polygons.

        Returns:
            list[dict]: A list of dictionaries containing cropped images and their sizes.

        Raises:
            NotImplementedError: If det_box_type is not 'quad' or 'poly'.
        """

        dt_boxes = np.array(dt_polys)
        output_list = []
        for bno in range(len(dt_boxes)):
            tmp_box = copy.deepcopy(dt_boxes[bno])
            img_crop = self.get_minarea_rect_crop(img, tmp_box)
            output_list.append(img_crop)

        return output_list

    def get_minarea_rect_crop(self, img: np.ndarray, points: np.ndarray) -> np.ndarray:
        """
        Get the minimum area rectangle crop from the given image and points.

        Args:
            img (np.ndarray): The input image.
            points (np.ndarray): A list of points defining the shape to be cropped.

        Returns:
            np.ndarray: The cropped image with the minimum area rectangle.
        """
        bounding_box = cv2.minAreaRect(np.array(points).astype(np.int32))
        points = sorted(list(cv2.boxPoints(bounding_box)), key=lambda x: x[0])

        index_a, index_b, index_c, index_d = 0, 1, 2, 3
        if points[1][1] > points[0][1]:
            index_a = 0
            index_d = 1
        else:
            index_a = 1
            index_d = 0
        if points[3][1] > points[2][1]:
            index_b = 2
            index_c = 3
        else:
            index_b = 3
            index_c = 2

        box = [points[index_a], points[index_b], points[index_c], points[index_d]]
        crop_img = self.get_rotate_crop_image(img, np.array(box))
        return crop_img

    def _rotate_image(self, image_array_list: List[np.ndarray], rotate_angle_list: List[int]):
        assert len(image_array_list) == len(
            rotate_angle_list
        ), f"Length of image ({len(image_array_list)}) must match length of angle ({len(rotate_angle_list)})"

        for angle in rotate_angle_list:
            assert angle in [0, 1], f"rotate_angle must be 0 or 1, now it's {angle}"

        rotated_images = []
        for image_array, rotate_indicator in zip(image_array_list, rotate_angle_list):
            # Convert 0/1 indicator to actual rotation angle
            rotate_angle = rotate_indicator * 180
            if rotate_angle < 0 or rotate_angle >= 360:
                raise ValueError("`angle` should be in range [0, 360)")

            if rotate_angle < 1e-7:
                rotated_images.append(image_array)
                continue

            # Should we align corners?
            h, w = image_array.shape[:2]
            center = (w / 2, h / 2)
            scale = 1.0
            mat = cv2.getRotationMatrix2D(center, rotate_angle, scale)
            cos = np.abs(mat[0, 0])
            sin = np.abs(mat[0, 1])
            new_w = int((h * sin) + (w * cos))
            new_h = int((h * cos) + (w * sin))
            mat[0, 2] += (new_w - w) / 2
            mat[1, 2] += (new_h - h) / 2
            dst_size = (new_w, new_h)

            rotated = cv2.warpAffine(
                image_array,
                mat,
                dst_size,
                flags=cv2.INTER_CUBIC,
            )
            rotated_images.append(rotated)
        logger.info(f"Number of rotated images: {len(rotated_images)}")
        return rotated_images

    def get_rotate_crop_image(self, img: np.ndarray, points: list) -> np.ndarray:
        """
        Crop and rotate the input image based on the given four points to form a perspective-transformed image.

        Args:
            img (np.ndarray): The input image array.
            points (list): A list of four 2D points defining the crop region in the image.

        Returns:
            np.ndarray: The transformed image array.
        """
        assert len(points) == 4, "shape of points must be 4*2"
        img_crop_width = int(
            max(
                np.linalg.norm(points[0] - points[1]),
                np.linalg.norm(points[2] - points[3]),
            )
        )
        img_crop_height = int(
            max(
                np.linalg.norm(points[0] - points[3]),
                np.linalg.norm(points[1] - points[2]),
            )
        )
        pts_std = np.float32(
            [
                [0, 0],
                [img_crop_width, 0],
                [img_crop_width, img_crop_height],
                [0, img_crop_height],
            ]
        )
        M = cv2.getPerspectiveTransform(points, pts_std)
        dst_img = cv2.warpPerspective(
            img,
            M,
            (img_crop_width, img_crop_height),
            borderMode=cv2.BORDER_REPLICATE,
            flags=cv2.INTER_CUBIC,
        )
        dst_img_height, dst_img_width = dst_img.shape[0:2]
        if dst_img_height * 1.0 / dst_img_width >= 1.5:
            dst_img = np.rot90(dst_img)
        return dst_img

    def process_image(self, image_path: str) -> List[Dict[str, Any]]:
        """
        Process single image for OCR recognition
        
        Args:
            image_path: Image path
            
        Returns:
            List of recognition results
        """
        if not os.path.exists(image_path):
            logger.error(f"Image file does not exist: {image_path}")
            return []
            
        try:
            # Read image
            image = cv2.imread(image_path)
            if image is None:
                logger.error(f"Failed to read image: {image_path}")
                return []
                
            logger.info(f"Start processing image: {image_path}")
            start_time = time.time()
            
            # Text detection
            logger.info(f"Start text detection")
            det_start = time.time()
            output_det = self.model_det.predict(image)
            det_time = time.time() - det_start
            logger.info(f"Text detection time: {det_time}s")
            
            results = []
            result_det = []
            for res in output_det:
                det_polys = res.get("dt_polys", [])
                det_scores = res.get("dt_scores", [])
                for idx, det_poly in enumerate(det_polys):
                    result_det.append({
                        "idx": idx,
                        "dt_polys": det_poly,
                        "dt_scores": det_scores[idx],
                    })
                logger.info(f"Detected {len(det_polys)} text regions")
            images_det = self._crop_by_polys(image, det_polys)
            for idx, img in enumerate(images_det):
                cv2.imwrite(f"{self.output_dir}/cropped_det_{idx}.png", img)
                    
            # Text direction classification
            logger.info(f"Start text direction classification")
            ori_start = time.time()
            output_ori = self.model_ori.predict(images_det)
            ori_time = time.time() - ori_start
            logger.info(f"Text direction classification time: {ori_time}s")
            angles = [
                int(ori_res["class_ids"][0])
                for ori_res in output_ori
            ]
            images_ori = self._rotate_image(images_det, angles)
            for idx, img in enumerate(images_ori):
                cv2.imwrite(f"{self.output_dir}/cropped_ori_{idx}.png", img)
            
            # Text recognition
            logger.info(f"Start text recognition")
            rec_start = time.time()
            for item in result_det:
                output_rec = self.model_rec.predict(images_ori[item["idx"]])
                for rec_res in output_rec:
                    rec_text = rec_res.get("rec_text", "")
                    rec_score = rec_res.get("rec_score", 0)
                    results.append({
                        "idx": item["idx"],
                        "dt_polys": item["dt_polys"].tolist(),
                        "dt_scores": item["dt_scores"],
                        "rec_res": rec_text,
                        "rec_score": rec_score,
                    })
            rec_time = time.time() - rec_start
            logger.info(f"Detection time-text recognition: {rec_time}s")

            total_time = time.time() - start_time
            logger.info(f"Total processing time: {total_time:.3f}s")
            
            return results
            
        except Exception as e:
            logger.exception(f"Failed to process image: {e}")
            return []
    
    def batch_process(self, image_paths: List[str]) -> Dict[str, List[Dict[str, Any]]]:
        """
        Batch process images
        
        Args:
            image_paths: List of image paths
            
        Returns:
            Batch processing results
        """
        batch_results = {}
        
        for image_path in image_paths:
            results = self.process_image(image_path)
            batch_results[image_path] = results
            
        return batch_results


def main():
    """Main function"""
    # Initialize OCR processor
    ocr_processor = OCRProcessor(
        output_dir="/opt/output/"
    )
    
    # Process single image
    image_path = "/opt/test/test.png"
    results = ocr_processor.process_image(image_path)
    
    # Output results
    print("\n=== OCR results ===")
    for result in results:
        print(result)


if __name__ == "__main__":
    main()

Supported Models

Model Type	Model Name	Input Shape	Chip Support
Text Detection	PP-OCRv4_mobile_det	(1,3,640,480)	910B/310P/310B
Text Recognition	PP-OCRv4_mobile_rec	(1,3,48,320)	910B/310P/310B
Image Classification	ResNet50	(1,3,224,224)	910B/310P/310B
Object Detection	RT-DETR-L	Multi-input	910B/310P/310B

Summary and Recommendations

Approach Comparison

Approach 5 (PaddleX): ✅ Highly Recommended - Official support, excellent performance, stable and reliable
Approach 3 (ONNX Runtime): ⚠️ Usable but average performance, suitable for quick validation
Approach 4 (OM Model): ⚠️ High technical difficulty, suitable for deep customization
Approaches 1 & 2: ❌ Not recommended due to multiple issues

Best Practices

Prioritize Approach 5: Use PaddleX high-performance inference plugin
Fixed Input Shapes: Avoid complexity introduced by dynamic shapes
Model Warm-up: Perform warm-up operations before inference
Performance Monitoring: Use Ascend CANN Profiling tools for performance optimization
Version Management: Keep CANN toolkit and PaddleX versions synchronized

Performance Optimization Recommendations

Use FP16 precision to improve inference speed
Configure batch size appropriately
Utilize operator caching mechanisms
Regularly clean up kernel_meta temporary files

Through this comprehensive guide, developers can select the appropriate deployment approach based on their specific requirements and reference corresponding troubleshooting methods to successfully deploy PaddlePaddle models on the Ascend 310P platform.