Predicting atrial fibrillation in the real world: head-to-head comparison of five risk scores
A Estrada-Magana, J R Medina-Inojosa, M A Sheffeh, L M Ortega-Aviles, B Medina-Inojosa, S C Bhyravajosyula, A A Rabinstein, P A Friedman, I Z Attia, A J Deshmukh, P A Noseworthy, F Lopez-JimenezAbstract
Introduction
Atrial fibrillation (AF) is a major contributor to stroke risk, yet many cases remain undiagnosed due to its silent and paroxysmal nature. To address this, multiple risk scores have been developed to predict incident AF. This study aimed to evaluate the predictive accuracy of five widely used clinical AF prediction tools (i.e., CHARGE-AF, CHA2DS2-VASc, mC2HEST, HATCH, and HAVOC) among patients undergoing long-term continuous monitoring with implantable loop recorders (ILRs).
Methods
We conducted a retrospective cohort study across three major US academic medical centers, including patients with ILRs and no prior diagnosis or ECG evidence of AF. Clinical variables used to calculate each risk score were extracted from electronic health records. Discrimination was assessed using the area under the receiver operating characteristic curve (AUC). Kaplan-Meier survival analysis and Cox proportional hazards models were used to evaluate the association between each score’s risk categories and incident AF (defined as any episode lasting ≥6 minutes). Calibration was assessed using calibration plots and Brier scores.
Results
A total of 3,895 patients were included (mean age 61 ± 17 yrs.; 51% male). Incident AF was detected in 879 patients (22%) over a median follow up of 17 months (IQR 8-35). Table 1 presents baseline characteristics and device detected AF event data during follow up. CHARGE-AF had the highest AUC (0.68) but also the highest Brier score (Fig. 1). Failure curves showed clear risk stratification across all tools (Fig. 2). Calibration plots (Fig. 3) demonstrated modest agreement between predicted and observed risk across all models. Using CHARGE-AF as the reference, pairwise comparisons of Brier scores via the Wilcoxon signed-rank test showed lower calibration accuracy than CHA2DS2-VASc (p = 0.03), mC2HEST (p = 0.001), and HAVOC (p = 0.001), with no significant difference compared to HATCH (p = 0.15).
Conclusions
In patients undergoing continuous rhythm monitoring with ILRs, commonly used AF prediction scores demonstrated modest discriminatory ability and similar calibration performance. These findings suggest that existing risk models, which were largely developed in cohorts with clinically diagnosed AF, may not optimally translate to continuously monitored populations. Future work is needed to develop and validate AF prediction models specifically in cohorts undergoing long-term continuous cardiac monitoring.Figure 1Figure 2-3