LLM-Guided Graph Structure Learning for Alert Convergence in AIOps

doi:10.3390/computers15070412

DOI: 10.3390/computers15070412 ISSN: 2073-431X

LLM-Guided Graph Structure Learning for Alert Convergence in AIOps

Haodong Zou, Yichen Zhao, Xin Chen, Ling Wang, Jinghang Yu, Long Yuan, Luokai Jiang

In modern cloud-native systems, a single root cause can trigger cascading anomalies across multiple entities (e.g., microservices, databases, and hosts), generating alert storms with hundreds or thousands of heterogeneous alerts. Alert convergence (automatically grouping these alerts into actionable incident tickets) is critical for reducing operator burden and recovery time. Existing graph-based methods construct a topological graph from known entity dependencies and then leverage Graph Neural Networks (GNNs) for information propagation, but they rely on static physical topologies that fail to capture implicit fault propagation paths. Large Language Model (LLM)-based methods focus on reasoning about the textual information of alerts, yet they do not incorporate global topological structure and struggle with consistency at scale. Motivated by these limitations, we propose LLM-Guided Graph Structure Learning (LLM-GSL), a novel framework that combines the semantic reasoning ability of LLMs with the structural modeling power of GNNs for alert convergence. Specifically, LLM-GSL first leverages an LLM to evaluate pairwise entity relationships and discover implicit fault propagation paths that are absent from static topologies, thereby enhancing the physical-topology graph into a more complete structure. A Graph Attention Network (GAT) then refines alert representations over this enhanced graph via graph message passing, guided by a self-supervised graph affinity loss with continuous multi-modal supervision targets that fuse adjacency structure, textual affinity, and temporal affinity. Finally, density-based clustering groups the learned representations into incident tickets. Experiments on five public datasets, including four LogHub-derived datasets and one RCAEval microservice fault-injection subset, demonstrate that LLM-GSL achieves an average F1-score of 96.2%, outperforming six baselines including both traditional clustering and LLM-based methods by at least 14.0 percentage points.

Outline

LLM-Guided Graph Structure Learning for Alert Convergence in AIOps

More from our Archive