Unleashing Triton on CPUs: Compilation and Runtime Co-Optimization for Scalable Vector Architectures
Jianan Li, Xiaonan Chai, Wei GaoWhile the Triton compiler has revolutionized GPU kernel development, its deployment on general-purpose CPUs struggles to fully utilize the underlying hardware capabilities. This is primarily due to the semantic gap between Triton’s SPMD execution model and CPU vector architectures, which leads to suboptimal utilization of vector units during complex memory accesses. In this paper, we present a comprehensive compilation and runtime co-optimization framework for Triton-CPU, specifically targeting Vector Length Agnostic architectures (VLA) like ARM SVE. At the compiler level, we propose a novel semantic reconstruction and explicit base-offset decoupling strategy, enabling native VLA gather/scatter generation and eliminating scalar loop overheads. At the runtime level, we introduce a Machine Learning-driven thread scheduling model to optimally orchestrate the synergy between Thread-Level Parallelism and Vector-Level Parallelism. Extensive evaluations on an ARM-based multi-core processor demonstrate that our framework achieves up to a 2.0× throughput improvement for compute-bound GEMM operators (peaking at 346 GFLOPS), notably outperforming the hand-optimized OpenBLAS library by up to 1.54× at small-to-medium scales. Additionally, it delivers a 1.7× speedup for element-wise workloads. Furthermore, our optimizations saturate memory bandwidth (up to 55 GB/s) for memory-bound operators with zero compilation bloat, establishing a robust, high-performance foundation for deploying deep learning models on general-purpose CPUs.