DOI: 10.1145/3808139 ISSN: 2994-970X
Hallucinations in LLM-Based Code Summarization: Unveiling, Detection, and Mitigation
Guanghua Wan, Yuanning Feng, Yao Wan, Zhaoyang Chu, Zhangqian Bi, Junxiao Han, Zhou Zhao, Hongyu Zhang, Pingpeng Yuan, Xuanhua Shi, Hai Jin
Code summarization plays a vital role in program comprehension and software maintenance by generating natural language descriptions to summarize the semantics of code. While
Large Language Models
(LLMs) have shown remarkable performance in this area, recent empirical studies reveal a critical limitation:
LLMs are prone to hallucinations, producing summaries that are factually inaccurate or unfaithful to the source code, potentially misleading developers.
In this paper, we propose to unveil, detect, and mitigate hallucinations in LLM-based code summarization. First, we construct Hallu-Eval, a novel dataset for unveiling hallucination phenomena and rigorously evaluating the effectiveness of hallucination detection and mitigation in LLM-based code summarization. It comprises both original code snippets to capture naturally occurring hallucinations and their semantically perturbed counterparts, which are designed to systematically induce challenging logical hallucinations, all complemented with manual hallucination annotations on a curated testbed of 800 code-summary pairs. Next, we propose Hallu-Det, a synergistic approach that combines direct entity-level detection to identify explicit hallucinations with a synonymous mutation-based refinement to reliably confirm or refute more ambiguous cases. Finally, we introduce Hallu-Shield, an inference-time mitigation approach that leverages an external value model to guide LLMs toward producing more faithful summaries without costly retraining of the LLM itself. Extensive experiments show that Hallu-Eval effectively triggers hallucinations, increasing the hallucination rate of models such as Qwen2.5-Coder-7B from 17% to 97% on perturbed code. Our detection approach, Hallu-Det, achieves the best performance among baselines, reaching an F1-score of 0.95 for summaries generated by Qwen2.5-Coder-7B. Moreover, our mitigation method, Hallu-Shield, reduces hallucination rates. For example, it lowers the rate from 66% to 59%, a 10.6% relative reduction, on DeepSeek-Coder-6.7B, while simultaneously improving summary quality, achieving a 74.0% win rate evaluated by an LLM-as-a-judge majority vote ensemble.