DOI: 10.1145/3827621 ISSN: 1544-3566

Parallax: Performance Prediction for Training–Inference Co-Execution

Zesen Hu, Desheng Wang, Sichao Chen, Qiang Wang, Shuo Si, Weizhe Zhang

GPU clusters for deep learning (DL) workloads, especially inference, generally suffer from low utilization due to resource over-provisioning for satisfying their strict latency requirements. Co-locating throughput-oriented training jobs with latency-sensitive inference services is a promising approach to reclaim idle resources. However, accurately predicting the performance interference of such heterogeneous workloads under two mainstream GPU sharing mechanisms—Time-Slicing and Multi-Process Service (MPS)—remains a critical challenge. Existing predictors either focus on single-tenant scenarios or lack the fidelity to capture the complex contention patterns between training and inference. In this paper, we present Parallax, a fine-grained and accurate performance prediction framework tailored for training–inference co-execution. Parallax introduces interpretable modeling strategies for the two primary GPU sharing paradigms. For Time-Slicing, we propose a simulation-based model leveraging the virtual time slice and switching overhead abstractions to reconstruct operator-level interleaving and context-switching costs. For MPS, we develop a two-stage framework that first predicts resource utilization under concurrency and then quantifies performance degradation caused by microarchitectural contention, such as memory bandwidth and cache. This resource-centric approach ensures robust generalization to unseen workload combinations. Extensive evaluations on modern GPUs demonstrate the high accuracy of Parallax, predicting execution latency with a MAPE of 4.33% for Time-Slicing and 6.12% for MPS across diverse DL models (averaged over training and inference). Parallax is available at https://github.com/HIT-CeeCG/Parallax.

More from our Archive