Evaluating Large Language Models for Automated Code Vulnerability Detection
Achraf Ghorbel, Osama Hosam, Nassim Tinkicht, Ghazi Ben AyedSeven large language models were compared against binary vulnerability classification using the DiverseVul benchmark: StarCoder2-7B, Phi-3.5-Mini-Instruct, DeepSeek-Coder-6.7B-Instruct, Llama3-8B-Instruct, Gemma-7B-IT, Qwen2.5-7B-Instruct and GPT-4o-mini. Each model was evaluated using a unified zero-shot prompting protocol with generation settings chosen to minimise stochasticity, and API-accessible models run with temperature [Formula: see text] 0 or the lowest value allowed by the platform, while locally executed models used greedy decoding. On a balanced test set of 3,000 samples, accuracy and precision, recall, [Formula: see text]1-score and Cohen Kappa were calculated. Accuracy ranged from 39.93% to 56.23%. The highest accuracy (56.23) and the highest Kappa (0.1246) were obtained with GPT-4o-mini. Using standard interpretation guidelines, a Kappa of 0.1246 indicates only slight agreement, suggesting that the apparent performance reflected by accuracy is weaker than it initially seems. The code-specialised models were more prone to over-flag vulnerabilities and accumulate false positives, whereas instructions-tuned general-purpose models were less prone to do so. Confusion-matrix profiles and inference-time measures are also provided. There is no introduction of a new method of detection. The paper therefore controlled zero-shot evaluation of these models without fine-tuning, or in a task-specific manner.