Unleashing Triton on CPUs: Compilation and Runtime Co-Optimization for Scalable Vector Architectures

doi:10.3390/computers15070406

DOI: 10.3390/computers15070406 ISSN: 2073-431X

Unleashing Triton on CPUs: Compilation and Runtime Co-Optimization for Scalable Vector Architectures

Jianan Li, Xiaonan Chai, Wei Gao

While the Triton compiler has revolutionized GPU kernel development, its deployment on general-purpose CPUs struggles to fully utilize the underlying hardware capabilities. This is primarily due to the semantic gap between Triton’s SPMD execution model and CPU vector architectures, which leads to suboptimal utilization of vector units during complex memory accesses. In this paper, we present a comprehensive compilation and runtime co-optimization framework for Triton-CPU, specifically targeting Vector Length Agnostic architectures (VLA) like ARM SVE. At the compiler level, we propose a novel semantic reconstruction and explicit base-offset decoupling strategy, enabling native VLA gather/scatter generation and eliminating scalar loop overheads. At the runtime level, we introduce a Machine Learning-driven thread scheduling model to optimally orchestrate the synergy between Thread-Level Parallelism and Vector-Level Parallelism. Extensive evaluations on an ARM-based multi-core processor demonstrate that our framework achieves up to a 2.0× throughput improvement for compute-bound GEMM operators (peaking at 346 GFLOPS), notably outperforming the hand-optimized OpenBLAS library by up to 1.54× at small-to-medium scales. Additionally, it delivers a 1.7× speedup for element-wise workloads. Furthermore, our optimizations saturate memory bandwidth (up to 55 GB/s) for memory-bound operators with zero compilation bloat, establishing a robust, high-performance foundation for deploying deep learning models on general-purpose CPUs.

Outline

Unleashing Triton on CPUs: Compilation and Runtime Co-Optimization for Scalable Vector Architectures

More from our Archive