DOI: 10.1145/3822179 ISSN: 1544-3566

BridgedRing: A Cost-Effective Hardware-Software Co-Design to Overcome the UPI Bottleneck in GPU Servers

Wanfeng Huang, Yihao Zhao, Yakun Zhang, Jianghong Ma, Yunming Ye

Distributed training on standard dual-socket GPU servers suffers from the “UPI Wall” — a bottleneck stemming from the interplay between the low physical bandwidth of the Ultra Path Interconnect (UPI) and the severe bidirectional contention induced by NCCL’s ring-based collectives. Software-only optimizations cannot overcome this hardware constraint. We present

BridgedRing
, a hardware-software co-design that bypasses the UPI Wall using an off-the-shelf NVLink Bridge costing approximately $200 — less than 0.5% of the total cost of a typical 8-GPU server. Our topology-aware algorithm redirects all cross-NUMA traffic through this high-bandwidth bypass while preserving intra-NUMA communication over PCIe. Evaluated on 8 × A6000 servers against NCCL and hierarchical algorithms:
BridgedRing
achieves up to 1.97 × higher collective bandwidth and delivers 1.31 × end-to-end throughput for GPT-13B training in
Megatron-LM
(TP=8, SP=8), 1.36 × for GPT-6.7B, and 1.40 × for Qwen3-1.7B in
PyTorch DDP
. Crucially, these gains require no framework modifications , offering immediate, cost-effective acceleration for the vast installed base of standard GPU servers.

More from our Archive