Blinded, bias‐controlled multi‐rater evaluation of human‐versus‐AI brain metastasis segmentation using a hybrid foundation‐model framework
Yiding Han, Enze Zhu, Mhd Hasan AI Mekdash, Omar Awad, Piyush Pathak, Shixiao Liang, Daniel Allan Hamstra, Xizhe Zhang, Zaid Ali Siddiqui, Baozhou SunAbstract
Background
Accurate segmentation of brain metastases (BM) is essential for diagnosis, stereotactic radiosurgery planning, and longitudinal assessment. However, manual contouring is time‐intensive, limiting clinical scalability, and exhibits substantial inter‐observer variability. This variability complicates objective assessment of automated segmentation methods and challenges interpretation of model performance.
Purpose
To address these limitations, we developed TUM‐SAM, a hybrid foundation‐model framework for fully automated BM segmentation, and introduced a bias‐controlled, blinded multi‐rater evaluation paradigm to determine whether AI‐based BM segmentation has reached expert‐level performance and whether AI‐generated contours are preferred by human experts under unbiased assessment.
Methods
TUM‐SAM integrates nnU‐Net‐based lesion detection with a tumor‐adapted Med‐SAM segmentation model to enable prompt‐free, fully automated segmentation. Training used 301 patients (2548 lesions), and external evaluation used an independent cohort of 105 patients (397 lesions). Segmentation accuracy was benchmarked against DeepMedic and nnU‐Net using Dice similarity coefficient (DSC) and 95th‐percentile Hausdorff distance (HD95). Two physicians contoured all external cases, and a third physician contoured a 20‐patient subset for a blinded, tumor‐level, multi‐rater preference study. Pairwise contour preferences were analyzed using a Bradley–Terry probabilistic model to obtain bias‐adjusted estimates of relative contour quality while accounting for rater‐specific tendencies and case difficulty.
Results
In the external cohort, TUM‐SAM achieved a lesion‐wise detection sensitivity of 0.94 and outperformed DeepMedic and nnU‐Net across all tumor sizes, with a mean DSC of 0.84 and HD95 of 1.9 mm (nnU‐Net/DeepMedic: DSC < 0.70, HD95 > 3.3 mm). Across voxel‐wise evaluation, TUM‐SAM's geometric performance fell within the range of inter‐observer variability among physicians and was sensitive to reference construction. In contrast, in the blinded rater study, experts preferred TUM‐SAM–generated contours over individual physician contours in 81–87% of raw comparisons; Bradley–Terry analysis yielded conservative, bias‐corrected win probabilities of 55–56%, indicating consistent preference after adjustment for rater and case difficulty.
Conclusion
Using a bias‐controlled, blinded multi‐rater evaluation framework, TUM‐SAM demonstrates brain metastasis segmentation quality that is consistently preferred by expert physicians, highlighting the limitations of agreement‐based voxel‐wise metrics under inter‐observer variability. These findings underscore the dependence of conventional evaluation on reference definition and support preference‐based assessment as a complementary approach for evaluating AI segmentation quality in BM MRI.