Reliability Analysis and Optimization of a Multi‐Server Retrial Repairable System
Yihao Zhao, Jinbiao WuABSTRACT
The fast development of complex systems like high performance computing (HPC) clusters shows the importance of good infrastructure that balances performance, reliability and cost. However, hardware malfunctions and undetected software failures are a major threat to operational stability. To solve above mentioned challenges, this paper proposes a comprehensive performance and cost‐benefit analysis based on a multi‐server machine repair model. The model incorporates several features of real service data, including multiple operating machines with warm standbys, imperfect coverage that may require system rebooting, and a retrial mechanism for failed machines that attempt to enter a fully occupied service facility. This study models the system as a three‐dimensional continuous‐time Markov chain (CTMC) and then derives its steady‐state probability distribution. Based on this, several key performance indicators are analyzed and calculated. On this basis, a comprehensive cost function is constructed to systematically evaluate the economic feasibility and operational efficiency of the system. We apply optimization technique to determine the optimally the number of repairmen, number of warm standby units, the repair rate, and the reboot rate of system so that minimize the total cost. Numerical experiments are provided to substantiate the effectiveness of the proposed model in identifying an optimal operational strategy and the results can provide practical insights for managers to make decisions based on data for resource allocation and maintenance policies.