DOI: 10.3390/jcm15124846 ISSN: 2077-0383

Hybrid Metaheuristic Feature Selection for Breast Cancer Detection in Digital Mammography: A Feasibility Study with Nested Validation, Benchmarking, and External Stress Testing

Bandar S. Alshreef, Yousif A. Kariri

Background/Objectives: The “small-n-large-p” dilemma in mammography artificial intelligence (AI)—where the number of candidate imaging features far exceeds the number of labeled cases—commonly results in model overfitting, unstable feature selection, and poor generalization across clinical settings. This study aims to reassess the internal performance of the HiTopology-GOA-CSA (Grasshopper Optimization Algorithm–Crow Search Algorithm) feature-selection framework for mammography using a larger real Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) cohort and a stricter leakage-aware evaluation strategy. Methods: In this retrospective computational study using public anonymized datasets, an expanded internal cohort of 98 CBIS-DDSM mass cases (49 benign, 49 malignant) was assembled from digital imaging and communications in medicine (DICOM) region of interest (ROI) series. A total of 1074 features were extracted per case, including 88 handcrafted radiomic descriptors and 986 EfficientNet-B5 deep features. HiTopology-GOA-CSA selected 102 features, corresponding to 91% feature reduction. Two internal evaluation modes were compared: Mode A, which matched the original pilot methodology by performing feature selection once on the full cohort before cross-validation, and Mode B, which used strict nested feature selection within training folds. Performance was assessed with 5-fold stratified cross-validation using a multilayer perceptron (MLP) classifier. Results: On the expanded cohort, Mode A achieved an area under the receiver operating characteristic curve (AUC) of 0.726 (95% CI: 0.594–0.858), sensitivity of 0.658, specificity of 0.651, and F1-score of 0.644. Under the stricter nested evaluation, Mode B achieved AUC of 0.683 (95% CI: 0.549–0.817), sensitivity of 0.598, specificity of 0.631, and F1-score of 0.595. Mean pairwise Jaccard similarity across nested folds was 0.604, indicating moderate feature stability. Benchmark comparisons showed that the proposed method was competitive but did not outperform standard baselines; LASSO logistic regression achieved the highest AUC of 0.739, while the proposed HiTopology-GOA-CSA + MLP achieved an AUC of 0.683. Real external validation on the locked VinDr-Mammo subset (n = 25) remained near-random (AUC of 0.500 [95% CI: 0.304–0.696]), with complete prediction collapse (sensitivity of 1.000, specificity of 0.000). Conclusions: The framework demonstrated feasibility for structured feature selection and stress testing in a small-cohort mammography AI setting; however, external validation revealed near-random discrimination and prediction collapse, indicating limited generalizability. These findings emphasize the need for benchmark comparisons, transparent uncertainty reporting, patient-level validation, and larger multicenter datasets before clinical translation.

More from our Archive