Artificial intelligence versus a cardiologist in electrocardiogram interpretation: a striking performance gap in 100 standard recordings
R Gonzalez Aguirre, J R Leal Diaz, J E Leal Cavazos, E Reyna Reyna, R P Alarcon Quilantan, O Santillan Garcia, M Cortes AguirreAbstract
Background
Large language models (LLMs) are increasingly applied in clinical reasoning, yet their diagnostic performance in standard electrocardiogram (ECG) interpretation remains uncertain.
Purpose
To compare the diagnostic accuracy of two advanced LLMs: ChatGPT-5 Plus (OpenAI) and Gemini Pro (Google), against a cardiologist interpreting 12-lead ECGs corresponding to typical cardiac disorders.
Methods
A total of 100 anonymised, straightforward electrocardiograms representing 13 common diagnostic categories (normal ECGs, atrial fibrillation, flutter, supraventricular tachycardia, ST-elevation myocardial infarction, right/left bundle branch block, atrioventricular blocks, Wolff–Parkinson–White, pacemaker rhythm and fascicular blocks) were independently interpreted by ChatGPT-5 Plus, Gemini Pro and two cardiologist.
The cardiologists interpretations were concordant in 99 of 100 cases, with a single discrepancy subsequently resolved by consensus. These interpretations were considered the reference standard.
All ECGs represented classical, unambiguous electrocardiographic patterns, ensuring diagnostic certainty.
Each interpreter provided up to two diagnoses per ECG. Paired comparisons used McNemar’s test (LLM vs cardiologist; LLM vs LLM).
Results
Overall accuracy was 14.8% for ChatGPT-5 Plus, 25.4% for Gemini Pro.
LLM performance was highest in atrial fibrillation (ChatGPT-5 Plus 9%, Gemini Pro 73%), left bundle branch block (31% vs 62%) and normal ECGs (30% vs 80%), and poor in AV block (ChatGPT-5 Plus 3.8%, Gemini Pro 3.8%), pacemaker rhythm (ChatGPT-5 Plus 0%, Gemini Pro 7.7%), and ST-elevation myocardial infarction (ChatGPT-5 Plus 8.3%, Gemini Pro 8.3%).
McNemar’s tests showed significant differences between each LLM and the cardiologist (p<0.001). Between LLMs, ChatGPT-5 Plus vs Gemini Pro differed significantly (McNemar p=0.014).
Conclusions
Gemini Pro outperformed ChatGPT-5 Plus. There is a striking and consistent difference between AI models and cardiologist interpretation. Even with paid and updated versions.
Artificial intelligence, in its current form, is not prepared for diagnostic use in electrocardiography, even in simple and typical cases. Accurate electrocardiogram interpretation remains dependent on human clinical expertise.Diagnostic accuracy