DOI: 10.1145/3821416 ISSN: 1049-331X
Assessing and Improving Prompting Large Language Models for Software Vulnerability Analysis
Yu Nong, Guangbei Yi, Mohammed Aldeen, Long Cheng, Hongxin Hu, Haipeng Cai
Large language models (LLMs) have demonstrated potential in diverse domains including software analysis. Yet there is a lack of systematic
assessment
of how LLMs perform in comparison to various extant approaches, and how LLMs may be
improved
, for software vulnerability analysis, via prompt engineering. In this paper, we present a comprehensive, large-scale
empirical study
of ten LLMs with seven prompting strategies versus nine traditional (five code-analysis- and four deep-learning-based) techniques on three vulnerability analysis tasks (detection, classification, and repair) against five real-world datasets (8,000+ C/C++ samples, including a zero-day dataset). We show that, with existing prompting strategies, LLMs often struggle with practical vulnerability analysis and underperform the traditional approaches. Via in-depth case analysis, we reveal that the evaluated LLMs frequently suffer from incorrect reasoning. Based on these findings, we improve the prompting with a vulnerability-specific adaptation of chain-of-thought (CoT), named Vulnerability-Semantics-guided Prompting (VSP). Our results show that VSP improves the performance of some of the LLMs in certain configurations across the three tasks. VSP also helps mitigate the reasoning limitations for some of the evaluated LLMs. For vulnerability detection on unseen data, improvements are limited or marginal for some models. We further identify seven common challenges that led to the LLMs’ incorrect answers in these tasks and provide actionable recommendations to help mitigate them.