DOI: 10.1093/europace/euag105.1228 ISSN: 1099-5129

Atrial/AV parsing and QRS govern clinical usefulness of image-native 12-lead ECG interpretation by a compact multimodal reasoning model

I Zeljkovic, A Novak, T Cvetko, A Jordan, N Pavlovic, S Manola

Abstract

Background

Large language models with native vision are beginning to read 12-lead ECGs, yet headline accuracy alone is a poor guide for clinical deployment. We asked whether a compact multimodal reasoning model can deliver clinically useful, element-level interpretations directly from ECG images, and which visual elements most strongly govern usefulness.

Methods

We evaluated OpenAI o4-mini on 120 real-world ECG images from the CaRD registry under a leakage-controlled protocol (vignettes without rhythm labels, explicit rates/regularity, device mentions, or biomarker values). Two blinded cardiologists scored seven binary elements—rate/rhythm, P-wave, PR, QRS, axis, ST, T—forming a Diagnostic Score (0–7; primary endpoint), and rated differential-diagnosis quality and recommendation usefulness (0–2). Statistics included Kruskal–Wallis/Mann–Whitney with BH-FDR, Wilson CIs, Spearman correlations, and proportional-odds models for usefulness. A prespecified composite RhythmParsing3 (all of P-wave, PR, and rate/rhythm correct) was tested as a deployment gate. Inter-rater agreement used Cohen’s κ / quadratic-weighted κ.

Results

Cases: 42 arrhythmia, 21 conduction/pacing, 21 ACS, 36 other. Mean Diagnostic Score was 4.64 ± 1.98 (median 5). Element accuracies (%) were: axis 75.0, P-wave 69.2, PR 69.2, QRS 67.5, rate/rhythm 65.0, T 60.8, ST 57.5. Scores differed by category (Kruskal–Wallis p<0.001), driven by arrhythmia vs other. Recommendation usefulness correlated with the Diagnostic Score (ρ=0.69) and with differential-diagnosis quality (ρ=0.68), both q<0.001. In adjusted ordinal models, P-wave correctness (OR 27.9, 95% CI 1.18–655; p=0.039), QRS correctness (OR 16.0, 4.48–56.9; p<0.001) and differential-diagnosis quality (OR 13.1, 5.11–33.3; p<0.001) independently predicted higher usefulness. Case-mix mattered: ACS vs arrhythmia showed OR 4.89 (1.77–13.47; p=0.002) in a category-only model and OR 25.5 (4.91–132.7; p<0.001) in a reduced adjusted model using RhythmParsing3 and QRS. Internal consistency of the 7-item score was Cronbach’s α=0.70; agreement κ=0.74 (binary) and 0.71 (ordinal). Post-hoc deployment analysis indicates that requiring RhythmParsing3 plus correct QRS functions as a pragmatic readiness gate for actionable recommendations, aligning usefulness with concrete, checkable visual primitives (atrial activity/AV relations and depolarization morphology).

Conclusions

A compact multimodal reasoning model achieved moderate, category-dependent performance on image-native ECGs. Clinical usefulness is governed by atrial/AV parsing and QRS characterization, suggesting a mechanistic, safety-aware route to deployment: gate model advice on RhythmParsing3 ∧ QRS; surface uncertainty/abstention when the gate fails; and target engineering on numeric interval extraction, pacing-spike detection, and reciprocal-lead checks for ischemia. Until such refinements are in place, the role is triage/decision support, not definitive interpretation.

More from our Archive