DOI: 10.3390/s26123911 ISSN: 1424-8220

Cross-Sensor and Cross-Population Generalization of Deep Learning Models for Digital Mammography: A Controlled Four-Country Benchmark of Five Backbone Architectures with Statistical Significance Testing

Somprasonk Gabbualoy, Pattarapong Phasukkit, Supan Tungjitkusolmun

Background/Objectives: Deep learning models for digital mammography sensor data are increasingly deployed across hospitals using different X-ray detector technologies and patient populations. Whether models trained on one sensor platform and population maintain accuracy when transferred to another has not been tested for the latest generation of mammography-specific foundation models under one controlled protocol. Methods: We fine-tuned five backbone architectures (ResNet-50, DINOv2-B14, Rad-DINO, Mammo-CLIP B5, and Mammo-FM) on CBIS-DDSM (film-digitized, USA, n = 714 validation) with three seeds, ablated a density-aware focal loss across three auxiliary weights, and evaluated transfer to three external sensor cohorts: CMMD (full-field digital, China, n = 1032), DMID (mixed digital, India, n = 509), and MIAS (film-digitized, UK, n = 322). Significance used paired DeLong z-tests with Benjamini–Hochberg FDR correction; temperature scaling tested post hoc recalibration at all transfer targets. Results: Within this single-source three-seed evaluation, ResNet-50 outperformed all four foundation models on CBIS-DDSM (AUC 0.867 vs. 0.847, 0.846, 0.813, and 0.703; all gaps p_adj < 0.05). The density-aware focal loss degraded both AUC and calibration at every weight tested. At transfer, every model lost 0.165 to 0.320 AUC points relative to in-distribution performance, with sensitivity at 95% specificity collapsing from 0.31 to 0.47 in-distribution to 0.11 to 0.22 across the three external targets. A per-seed Stouffer meta-analysis confirms that Mammo-CLIP B5 and Mammo-FM significantly outperformed ResNet-50 on DMID and Mammo-CLIP on CMMD, after BH-FDR; MIAS comparisons remained directional only. In the extremely dense subgroup (BI-RADS D4), Mammo-FM reached AUC 0.870 versus ResNet-50 at 0.842, a directional observation whose 95% CIs overlap heavily at the n = 140 sample size and which we do not interpret as a statistically supported advantage. Conclusions: In this single training-source, three-seed protocol, mammography-specific pretraining did not deliver the in-distribution AUC premium reported in the originating papers, and no architecture reached a level at which transfer deployment without local validation would be defensible. We frame these as observations specific to the present protocol rather than as broader conclusions about foundation models for mammography classification. The findings argue for sensor-stratified and population-stratified external validation and for local recalibration as practical prerequisites before clinical use. Code and weights are released under MIT license.

More from our Archive