Safety-concordant performance of large language models on cardiology vignettes

doi:10.1093/europace/euag105.1229

DOI: 10.1093/europace/euag105.1229 ISSN: 1099-5129

Safety-concordant performance of large language models on cardiology vignettes

I Zeljkovic, Q Zhu, A Jordan, A Lisicic, S Sokol Tomic, N Pavlovic, S Manola

Show PDF Cite

Abstract

Background

As LLMs move from demos to decision support, cardiology needs evaluation beyond item-level accuracy: specifically (i) avoidance of guideline-discordant or potentially harmful actions, (ii) stability of recommendations across repeated runs, and (iii) fidelity of rationales to contemporary guidance. We re-analysed a 199-vignette benchmark spanning basic care to board-level scenarios to prioritise these safety-centric endpoints.

Methods

Five text LLMs (Bard/PaLM-2, GPT-3.5 Turbo, GPT-4.0, GPT-4o, GPT-o1) answered multiple-choice vignettes (Sets A–D; increasing difficulty). Image-only items were excluded; ECG/imaging findings were described textually. Each vignette was asked in a fresh session with answer-order randomisation; five independent re-runs per item were also collected. A senior cardiology panel pre-specified a library of sentinel unsafe actions. Primary endpoint was safety-concordant accuracy (SCA): the proportion of items answered correctly when at least one alternative option was adjudicated unsafe. Secondary endpoints: (1) harm-avoidance rate (HAR), i.e., the fraction of items where the model did not select an unsafe option regardless of correctness; (2) flip-rate, i.e., the proportion of items where the model changed answers across re-runs (stability); (3) guideline-fidelity index from panel review of stratified explanations. Wilcoxon/Friedman tests compared models within difficulty tiers; Wilson CIs summarised proportions. Analyses are hypothesis-generating.

Results

Safety-weighted endpoints reshaped model separation at higher difficulty. Models with the highest overall accuracy (GPT-4.0, GPT-4o) also delivered the highest SCA and HAR, particularly on Sets C–D where unsafe distractors were frequent. Mid-tier performance (GPT-o1) narrowed the gap to GPT-4o on SCA despite lower raw accuracy, driven by better avoidance of sentinel unsafe options. Lower-performing models (GPT-3.5, Bard) showed occasional breaches on sentinel scenarios. Stability differed materially: GPT-4.0/4o showed the lowest flip-rates across re-runs; Bard and GPT-3.5 exhibited higher flip-rates, with inconsistent justifications on repeat queries. Majority-vote "consensus" modestly improved correctness on difficult items but did not fully eliminate unsafe picks. Panel-scored guideline fidelity in explanations tracked model rank at board-level difficulty: higher models more often articulated class/level and stepwise therapy, whereas lower models mixed intensity classes or skipped prerequisite steps.

Conclusions

In cardiology vignettes, safety-concordant accuracy, harm-avoidance, and stability provide a stricter, more clinically meaningful lens than accuracy alone. GPT-4-class models led across this safety composite, but non-zero unsafe selections persist on sentinel scenarios, underscoring the need for guardrails. Future work should integrate real-time guideline snippets and calibrate abstention when class/level is uncertain.

Outline

Safety-concordant performance of large language models on cardiology vignettes

Abstract

Background

Methods

Results

Conclusions

More from our Archive