DOI: 10.1145/3820156 ISSN: 1553-3077

A Six-Year Journey of Data Collecting, Reviewing, and Reporting for the TaihuLight Supercomputer

Bin Yang, Hao Wei, Wenhao Zhu, Yuhao Zhang, Weiguo Liu, Wei Xue

The system architecture of contemporary supercomputers is growing increasingly intricate with the ongoing evolution of system-wide network and storage technologies, making it challenging for application developers and system administrators to manage and utilize the escalating complexity of supercomputers effectively. Moreover, the limited experience of application developers and system administrators in conducting insightful analyses of diverse High-Performance Computing (HPC) workloads and the resulting array of resource utilization characteristics exacerbate the challenge. To address this issue, we undertake a comprehensive analysis of six years’ worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, with 41,508 nodes, currently ranked as the world’s 13th-fastest supercomputer. Our study provides valuable insights into operational management strategies for HPC systems (i.e., job hanging caused by heavy-load benchmark testing, job starvation caused by aggressive scheduling policies) and I/O workload characteristics (i.e., getattr operations spiking caused by massive access to grid files, and a large number of files accessed by many applications in a short period), shedding light on both challenges and opportunities for improvements in the HPC environment. This paper delineates our methodology, findings, and the significance of this study. Additionally, we discuss the potential of our research for future studies and practice within this domain.

More from our Archive