Evaluating artificial intelligence-based cardiovascular magnetic resonance segmentation methods: Comparability to manual segmentation and impact on effect size and sample size in the TRED-HF trial
A H Laila, L Askarova, V Ahluwalia, B P HallidayAbstract
Introduction
Centres across the United Kingdom use different cardiovascular magnetic resonance (CMR) segmentation methods to contour the heart, resulting in varying ventricular volumes and function. The TRED-HF trial evaluated heart failure medications withdrawal in patients with dilated cardiomyopathy, employing manual segmentation using the anatomical with papillary method. Artificial intelligence (AI) is increasingly used to assist segmentation. However, AI may demonstrate greater variability than manual segmentation and could affect trial outcomes and key parameters in trial design.
Purpose
Our study aims to determine which AI-based segmentation method is the most comparable to the manual method used in the TRED-HF trial and to evaluate how different segmentation methods influence effect size and sample size calculations.
Methods
This retrospective methodological study was conducted from 1 June to 20 July 2025 at a specialist heart and lung centre in the United Kingdom. CMR scans were re-analysed using four AI-based segmentation methods: anatomical with papillary (A+P), anatomical without papillary (A-P), smooth with papillary (S+P), and smooth without papillary (S-P). Thirteen ventricular metrics were compared between the AI-based method and the manual method. Agreement and bias were assessed using intraclass correlation coefficients (ICC) and Bland–Altman analysis. Analysis of covariance was used to assess the effect size of treatment withdrawal on left ventricular ejection fraction (LVEF). The required sample size was calculated using the standard deviation of LVEF differences between baseline and follow-up scans in the continued treatment arm.
Results
A total of 120 TRED-HF CMR scans were reanalysed. Among the AI-based methods, the A+P method was the most comparable to the manual method, achieving good-to-excellent agreement in 8 of 13 metrics, the highest ICC for LVEF (0.74, 95% CI: 0.62 to 0.83), and the smallest bias for 7 of 13 metrics. Segmentation method affected the effect size of treatment withdrawal. The manual method demonstrated the largest effect size (-8.77%), followed by the A-P method (-7.91%). Meanwhile, the S-P method demonstrated the smallest effect size (-7.53%). To achieve 80% statistical power to detect a 5% difference in LVEF between two groups, the manual method required the smallest sample size (38 participants), followed by the A-P method (42 participants). Meanwhile, the S-P method required the largest sample size (48 participants).
Conclusions
The choice of segmentation methods affected the effect size of treatment withdrawal and the required sample size in the trial and is therefore an important consideration when reviewing CMR-based research. AI-based segmentation method has the potential to be used in clinical trials, but manual segmentation method remains the most favourable method as it requires the smallest sample size to achieve the desired statistical power.Agreement between AI & manual methodFor image description, please refer to the figure legend and surrounding text.Effect size & sample size across methodsFor image description, please refer to the figure legend and surrounding text.