Safety-concordant performance of large language models on cardiology vignettes
I Zeljkovic, Q Zhu, A Jordan, A Lisicic, S Sokol Tomic, N Pavlovic, S ManolaAbstract
Background
As LLMs move from demos to decision support, cardiology needs evaluation beyond item-level accuracy: specifically (i) avoidance of guideline-discordant or potentially harmful actions, (ii) stability of recommendations across repeated runs, and (iii) fidelity of rationales to contemporary guidance. We re-analysed a 199-vignette benchmark spanning basic care to board-level scenarios to prioritise these safety-centric endpoints.
Methods
Five text LLMs (Bard/PaLM-2, GPT-3.5 Turbo, GPT-4.0, GPT-4o, GPT-o1) answered multiple-choice vignettes (Sets A–D; increasing difficulty). Image-only items were excluded; ECG/imaging findings were described textually. Each vignette was asked in a fresh session with answer-order randomisation; five independent re-runs per item were also collected. A senior cardiology panel pre-specified a library of sentinel unsafe actions. Primary endpoint was safety-concordant accuracy (SCA): the proportion of items answered correctly when at least one alternative option was adjudicated unsafe. Secondary endpoints: (1) harm-avoidance rate (HAR), i.e., the fraction of items where the model did not select an unsafe option regardless of correctness; (2) flip-rate, i.e., the proportion of items where the model changed answers across re-runs (stability); (3) guideline-fidelity index from panel review of stratified explanations. Wilcoxon/Friedman tests compared models within difficulty tiers; Wilson CIs summarised proportions. Analyses are hypothesis-generating.
Results
Safety-weighted endpoints reshaped model separation at higher difficulty. Models with the highest overall accuracy (GPT-4.0, GPT-4o) also delivered the highest SCA and HAR, particularly on Sets C–D where unsafe distractors were frequent. Mid-tier performance (GPT-o1) narrowed the gap to GPT-4o on SCA despite lower raw accuracy, driven by better avoidance of sentinel unsafe options. Lower-performing models (GPT-3.5, Bard) showed occasional breaches on sentinel scenarios. Stability differed materially: GPT-4.0/4o showed the lowest flip-rates across re-runs; Bard and GPT-3.5 exhibited higher flip-rates, with inconsistent justifications on repeat queries. Majority-vote "consensus" modestly improved correctness on difficult items but did not fully eliminate unsafe picks. Panel-scored guideline fidelity in explanations tracked model rank at board-level difficulty: higher models more often articulated class/level and stepwise therapy, whereas lower models mixed intensity classes or skipped prerequisite steps.
Conclusions
In cardiology vignettes, safety-concordant accuracy, harm-avoidance, and stability provide a stricter, more clinically meaningful lens than accuracy alone. GPT-4-class models led across this safety composite, but non-zero unsafe selections persist on sentinel scenarios, underscoring the need for guardrails. Future work should integrate real-time guideline snippets and calibrate abstention when class/level is uncertain.