Scalable and Energy-Efficient AI: System-Level Profiling of NVIDIA GPU Clusters for Distributed LLM Training
Muhammad Ali Shafique, Imran Latif, Hayat Ullah, Alex C. Newkirk, Arslan MunirThe rapid scaling of large language model (LLM) training has intensified demand for Graphics Processing Unit (GPU) clusters balancing throughput with energy efficiency. While NVIDIA’s H100 and B200 architectures are increasingly deployed in production datacenters, their comparative behavior under distributed training remains insufficiently characterized beyond vendor specifications, leaving datacenter operators without empirical guidance on metrics such as TFLOPs/kW and tokens-per-kilojoule. This work presents a system-level evaluation of single-node 8× H100 and 8× B200 configurations using Distributed Data Parallel (DDP) training across LLMs and vision–language models (VLMs) ranging from 7B to 32B parameters, spanning various real AI workload scenarios. We benchmark end-to-end throughput, utilization, power, energy, TFLOPs/kW, and tokens-per-kilojoule, complemented by architectural analysis explaining observed behavioral differences. Across LLM workloads, B200 achieves higher utilization (1–6%), faster training (up to 15%), and greater compute efficiency (up to 32% higher TFLOPs/GPU), attributable to higher memory bandwidth and large streaming multiprocessor (SM) count. However, B200 exhibits lower TFLOPs/kW and tokens-per-kilojoule, revealing a fundamental trade-off: throughput gains come at a measurable energy cost per useful token. VLM results further expose model-dependent asymmetries, with B200 consuming disproportionately more energy for lighter compute kernels due to elevated baseline power draw. These findings provide an empirical framework distinguishing compute efficiency from energy efficiency across next-generation GPU nodes, offering practical guidance for energy-aware AI datacenter design.