Protocol and Software Co-optimization: Overcoming the PCIe Transfer Bottleneck in dGPU Accelerated In-kernel ML
Wen Li, Song Zhu, Panpan Cuan, Zheng Zhang, Jie Zhao, Qinglei QiIn-kernel machine learning (in-kernel ML) enables low-overhead online decision-making for OS subsystems such as scheduling, storage, and security. However, in the current in-kernel-to-dGPU offloading path, inference data must traverse kernel space, user space, and device memory, making the software path and PCIe transfers the dominant costs in the small-batch regime and often erasing acceleration gains, rather than being limited by insufficient dGPU compute capability. To address this issue, this paper proposes a hierarchical co-optimization scheme of data reduction, lightweight compression, and protocol optimization: it combines feature selection, mixed-precision quantization, and lightweight compression to reduce transfer load, and uses a state-aware dynamic scheduler to adapt the transfer strategy to changes in PCIe bandwidth, CPU load, and data sparsity. On the dGPU, decompression and dequantization are pipelined with computation to increase overlap. Across multiple in-kernel workloads, our scheme reduces transferred bytes by 50%–87.5%, lowers the dGPU break-even batch size from 256 to 64, reduces end-to-end inference latency by 3.2 × –5.8 ×, and keeps the accuracy loss within 3%.