Design and Optimization of GEMM for Complex Numbers on Ascend NPU
Erkun Zhang, Yu Zhang, Pengxiang Xu, Lu LuIt is widely acknowledged that General Matrix Multiplication (GEMM) serves as a foundational kernel across numerous application domains. Complex numbers exhibit distinctive mathematical properties that enable their widespread adoption across engineering computing scenarios, including signal processing and signal transformation. This study investigates high-efficiency CGEMM, namely, complex-valued GEMM, for NPU hardware, broadening the application scope of NPUs beyond mainstream low-precision AI computation workloads. The major contributions of this study are as follows: (i) numerical precision and hardware utilization of the 3M and 4M decomposition schemes on Ascend NPUs are analyzed, and the 4M method is selected as the preferred CGEMM implementation under our tested hardware constraints to fit the bandwidth limitations of modern accelerators for both precision-sensitive and performance-critical matrix computation scenarios; (ii) a complete high-performance CGEMM design based on the 4M scheme tailored for Ascend NPUs is proposed, with an AIC/AIV dual-stream pipeline scheduling strategy equipped to coordinate padding operations, matrix–matrix multiplications, and element-wise instructions across multi-level memory hierarchies and compute units; (iii) a fine-grained task scheduling and assignment mechanism is implemented to maximize Cube core occupancy across diverse matrix dimensions, improving hardware utilization for various computation workloads. Our experimental measurements show that the proposed CGEMM achieves a competitive hardware utilization rate of 83.6% across all tested matrix configurations, enabling efficient exploitation of available computing resources. Meanwhile, we observe a measured average speedup of 1.14× relative to the AscendSipBoost implementation tested on an identical Ascend NPU, alongside a measured 3.17× speedup compared with cuBLAS running on the Nvidia GPU platform adopted in our experiments across all evaluated matrix sizes. These results reflect the promising capability of Ascend NPUs for high-precision complex-valued computing workloads within the tested experimental setup.