AI vs AI: performance of two large language models for ECG-based emergent cath lab activation in acute coronary syndrome

doi:10.1093/ejhf/xuag193.1437

DOI: 10.1093/ejhf/xuag193.1437 ISSN: 1388-9842

AI vs AI: performance of two large language models for ECG-based emergent cath lab activation in acute coronary syndrome

M Rocha, P Palma, H Moreira, J Goncalves, E Oliveira, B Cruz, B Viana, E Figueiredo, L Alves, T Branco, A Pinho, R Rodrigues

Show PDF Cite

Abstract

Introduction/Background

generative artificial intelligence (AI) models are increasingly available and used for clinical decision support, but their performance in real world ECG-based triage of acute coronary syndromes (ACS) remains uncertain.

Purpose

to evaluate and compare the performance of two large language models (LLMs), ChatGPT and Perplexity, for emergent cath lab activation decisions based solely on 12-lead ECGs.

Methods

we retrospectively analysed patients from a Portuguese hospital with emergent cath lab activation between January 2024 and December 2025 for suspected ACS. Demographic data, risk factors, culprit artery, left ventricular ejection fraction (LVEF) and right ventricular function (RVF) were collected. Each ECG was presented, with a standardised prompt, to ChatGPT and Perplexity, and the models were asked to: (i) decide on cath lab activation, (ii) provide an ECG diagnosis (e.g.anteriorSTEMI), (iii) infer the culprit artery, (iv) estimate LVEF as reduced vs non-reduced (<40% vs ≥40%),and (v) classify RVF. A cardiologist-adjudicated reference defined whether cath lab activation had been appropriate and whether AI outputs were correct. Accuracy was compared using McNemar’s test. Results We included 129 patients. Mean age was 63.5±13.8 years; ≈77% were male. Hypertension was present in 58.9%, dyslipidaemia in 57.4% and 44.2% were smokers. Median LVEF was 45% (IQR 34–55).The clinician’s activation decision was correct in 98.4% of patients. ChatGPTcorrectly classified cath lab activation in 40.3% and Perplexity in 52.7% of cases (Perplexity vs ChatGPT p=0.027; both p<0.001 vs clinician). ChatGPT recommended activation in 34.1% and Perplexity in 47.3% of ECGs. Despite an indication for activation in 127 patients, ChatGPT failed to activate the cath lab when it should have in 56.6% of cases and Perplexity in 44.2%. Correct ECG diagnosis was achieved in 13.2% by ChatGPT and 21.7% by Perplexity (p=0.054),and culprit artery in 16.3% vs 24.0% (p=0.10), respectively. For reduced vs non reduced LVEF, accuracy was 65.1% (ChatGPT) and 57.4% (Perplexity; p=0.13). RVF was better classified by both LLMs, being 82.2% accuracy for ChatGPTand 75.0% for Perplexity.

Conclusions

in this real-world cohort of emergent cath lab activations, two accessible general-purpose LLMs showed poor to modest performance for ECGbased triage, with Perplexity outperforming ChatGPT but both substantially inferior to expert clinicians and, most importantly, markedly prone to under-activation. Thus, these LLM AIs should not be used for emergent ACStriage.

Outline

AI vs AI: performance of two large language models for ECG-based emergent cath lab activation in acute coronary syndrome

Abstract

Introduction/Background

Purpose

Methods

Conclusions

More from our Archive