Blinded, bias‐controlled multi‐rater evaluation of human‐versus‐AI brain metastasis segmentation using a hybrid foundation‐model framework

doi:10.1002/mp.70538

DOI: 10.1002/mp.70538 ISSN: 0094-2405

Blinded, bias‐controlled multi‐rater evaluation of human‐versus‐AI brain metastasis segmentation using a hybrid foundation‐model framework

Yiding Han, Enze Zhu, Mhd Hasan AI Mekdash, Omar Awad, Piyush Pathak, Shixiao Liang, Daniel Allan Hamstra, Xizhe Zhang, Zaid Ali Siddiqui, Baozhou Sun

Show PDF Cite

Abstract

Background

Accurate segmentation of brain metastases (BM) is essential for diagnosis, stereotactic radiosurgery planning, and longitudinal assessment. However, manual contouring is time‐intensive, limiting clinical scalability, and exhibits substantial inter‐observer variability. This variability complicates objective assessment of automated segmentation methods and challenges interpretation of model performance.

Purpose

To address these limitations, we developed TUM‐SAM, a hybrid foundation‐model framework for fully automated BM segmentation, and introduced a bias‐controlled, blinded multi‐rater evaluation paradigm to determine whether AI‐based BM segmentation has reached expert‐level performance and whether AI‐generated contours are preferred by human experts under unbiased assessment.

Methods

TUM‐SAM integrates nnU‐Net‐based lesion detection with a tumor‐adapted Med‐SAM segmentation model to enable prompt‐free, fully automated segmentation. Training used 301 patients (2548 lesions), and external evaluation used an independent cohort of 105 patients (397 lesions). Segmentation accuracy was benchmarked against DeepMedic and nnU‐Net using Dice similarity coefficient (DSC) and 95th‐percentile Hausdorff distance (HD95). Two physicians contoured all external cases, and a third physician contoured a 20‐patient subset for a blinded, tumor‐level, multi‐rater preference study. Pairwise contour preferences were analyzed using a Bradley–Terry probabilistic model to obtain bias‐adjusted estimates of relative contour quality while accounting for rater‐specific tendencies and case difficulty.

Results

In the external cohort, TUM‐SAM achieved a lesion‐wise detection sensitivity of 0.94 and outperformed DeepMedic and nnU‐Net across all tumor sizes, with a mean DSC of 0.84 and HD95 of 1.9 mm (nnU‐Net/DeepMedic: DSC < 0.70, HD95 > 3.3 mm). Across voxel‐wise evaluation, TUM‐SAM's geometric performance fell within the range of inter‐observer variability among physicians and was sensitive to reference construction. In contrast, in the blinded rater study, experts preferred TUM‐SAM–generated contours over individual physician contours in 81–87% of raw comparisons; Bradley–Terry analysis yielded conservative, bias‐corrected win probabilities of 55–56%, indicating consistent preference after adjustment for rater and case difficulty.

Conclusion

Using a bias‐controlled, blinded multi‐rater evaluation framework, TUM‐SAM demonstrates brain metastasis segmentation quality that is consistently preferred by expert physicians, highlighting the limitations of agreement‐based voxel‐wise metrics under inter‐observer variability. These findings underscore the dependence of conventional evaluation on reference definition and support preference‐based assessment as a complementary approach for evaluating AI segmentation quality in BM MRI.

Outline

Blinded, bias‐controlled multi‐rater evaluation of human‐versus‐AI brain metastasis segmentation using a hybrid foundation‐model framework

Abstract

Background

Purpose

Methods

Results

Conclusion

More from our Archive