Parallax: Performance Prediction for Training–Inference Co-Execution
Zesen Hu, Desheng Wang, Sichao Chen, Qiang Wang, Shuo Si, Weizhe ZhangGPU clusters for deep learning (DL) workloads, especially inference, generally suffer from low utilization due to resource over-provisioning for satisfying their strict latency requirements. Co-locating throughput-oriented training jobs with latency-sensitive inference services is a promising approach to reclaim idle resources. However, accurately predicting the performance interference of such heterogeneous workloads under two mainstream GPU sharing mechanisms—Time-Slicing and Multi-Process Service (MPS)—remains a critical challenge. Existing predictors either focus on single-tenant scenarios or lack the fidelity to capture the complex contention patterns between training and inference. In this paper, we present Parallax, a fine-grained and accurate performance prediction framework tailored for training–inference co-execution. Parallax introduces interpretable modeling strategies for the two primary GPU sharing paradigms. For Time-Slicing, we propose a simulation-based model leveraging the virtual time slice and switching overhead abstractions to reconstruct operator-level interleaving and context-switching costs. For MPS, we develop a two-stage framework that first predicts resource utilization under concurrency and then quantifies performance degradation caused by microarchitectural contention, such as memory bandwidth and cache. This resource-centric approach ensures robust generalization to unseen workload combinations. Extensive evaluations on modern GPUs demonstrate the high accuracy of Parallax, predicting execution latency with a MAPE of 4.33% for Time-Slicing and 6.12% for MPS across diverse DL models (averaged over training and inference). Parallax is available at https://github.com/HIT-CeeCG/Parallax.