Towards the Localization of Multi-Root-Cause Failures in Microservice Systems: An Active Intervention Framework
Yazhuo Gao, Lin Yang, Lianxiao Meng, Ran Zhu, Yining CaoIn large-scale microservice systems, multi-root-cause failures often intertwine, significantly increasing overall system risk and triggering a deluge of cascading alerts that pose serious challenges to fault diagnosis and recovery. Existing root-cause localization techniques remain largely passive, relying on rule-based pattern recognition or graph-based propagation inference, and thus falter when faced with the complexity of multi–root-cause failures. To address these challenges, this paper introduces a novel active-intervention-based framework for root-cause localization. This framework uses Hierarchical Reinforcement Learning (HRL) to infer root causes and employs an Intervention-enhanced Graph ATtention network (IGAT) to predict the fault scenarios each cause may trigger. By iteratively comparing these predicted scenarios against the system’s real-time state, the framework dynamically refines its localization model. Experimental results on two public datasets and a constructed dataset show that our method outperforms the second-best method by at least 22% on the PR@1 metric in single root cause scenarios and leads by 51.7% on the RE@3 metric in multiple root cause scenarios. These results indicate that the method may offer certain advantages in the field of fault root cause analysis.