Data-Efficient and Explainable Multimodal Survival Prediction in NSCLC Using Deep Image Embeddings, Clinical Variables, and Gradient-Boosted Trees

doi:10.3390/diagnostics16121941

DOI: 10.3390/diagnostics16121941 ISSN: 2075-4418

Data-Efficient and Explainable Multimodal Survival Prediction in NSCLC Using Deep Image Embeddings, Clinical Variables, and Gradient-Boosted Trees

Sevim Sahin, Adil Gursel Karacor

Background/Objectives: Survival prediction in non-small cell lung cancer (NSCLC) remains challenging, particularly in limited-sample settings where end-to-end deep learning models may suffer from limited generalization. This study aimed to develop a data-efficient, multimodal, and explainable framework integrating computed tomography (CT)-derived imaging information with clinical variables for NSCLC survival prediction. Methods: CT images, tumor segmentations, and clinical data from the publicly available NSCLC Radiomics (LUNG1) dataset (377 patients) were used. Tumor-focused regions were extracted using segmentation masks, and pretrained RadImageNet-InceptionV3 embeddings were obtained from the largest tumor-containing slice and neighboring-slice summaries. Deep imaging embeddings, engineered imaging features, and clinical variables were fused into a unified tabular representation. To improve robustness under limited-sample conditions, feature blocks were compressed using principal component analysis. CatBoost, XGBoost, and LightGBM models were trained on a development set and evaluated on a strictly held-out final validation set. Results: In three-class survival stratification, assigning censored/non-event patients to the upper survival group produced the strongest ordinal prognostic performance. Under the EX_PLUS_NON_EX_TOP setting, CatBoost achieved the best holdout score-based class C-index of 0.655. In continuous survival regression, LightGBM achieved the best holdout event-patient C-index of 0.576. Clinical variables provided the dominant prognostic signal, while compact deep image embeddings contributed complementary information, particularly in separating short- and long-survival groups. SHAP analysis confirmed contributions from both clinical and image-derived features. Conclusions: The proposed framework provides a proof-of-concept demonstration of a data-efficient and explainable image-to-tabular approach for NSCLC survival prediction under strict internal holdout validation. The results suggest that pretrained CT embeddings, clinical variables, gradient-boosted trees, and SHAP-based interpretation can be combined in a feasible, limited-sample survival modeling pipeline, while external validation remains necessary before clinical translation.

Outline

Data-Efficient and Explainable Multimodal Survival Prediction in NSCLC Using Deep Image Embeddings, Clinical Variables, and Gradient-Boosted Trees

More from our Archive