DOI: 10.4103/wkrj.wkrj_35_26 ISSN: 3117-9789

Optimized and Interpretable Multimodal Machine Learning for Benchmark-based Drug-Response Prediction Using the Therapeutics Data Commons Genomics of Drug Sensitivity in Cancer 2 Dataset

Albina Ondabayeva

A
BSTRACT

Background:

Benchmark-based drug-response prediction is essential for reproducible pharmacogenomic machine–learning research. However, many studies insufficiently separate model-development stages, provide limited preprocessing transparency, and overstate biological interpretation from latent features.

Objective:

The objective is to develop and evaluate a reproducible multimodal machine–learning workflow for anticancer drug-response prediction using the Therapeutics Data Commons (TDC) Genomics of Drug Sensitivity in Cancer 2 (GDSC2) benchmark dataset.

Materials and Methods:

The TDC GDSC2 dataset was analyzed using the predefined benchmark split. Drug structures were encoded as 1024-bit Morgan fingerprints, while 17,737-dimensional cell-line features were standardized and reduced to 128 principal components, retaining approximately 61.8% of variance. Multimodal features were constructed by concatenating drug and cell-line representations. Ridge regression, ElasticNet, Random Forest, multilayer perceptron, XGBoost, and modality-specific baselines were compared. Hyperparameter optimization for XGBoost was performed using RandomizedSearchCV with three-fold cross-validation on the training set. The final model was retrained on combined training and validation data and evaluated once on the independent test set. Bootstrap resampling was used to estimate 95% confidence intervals. Model behavior was assessed using residual analysis, calibration-style evaluation, global feature importance, and principal component analysis (PCA) loading-based interpretability analysis.

Results:

The fusion XGBoost model achieved the strongest validation performance (root mean squared error [RMSE] = 1.0517, mean absolute error [MAE] = 0.7808, R 2 = 0.8485). On the independent test set, the final retrained model obtained RMSE = 1.0085, MAE = 0.7609, and R 2 = 0.8620. Bootstrap analysis showed stable estimates, with 95% confidence intervals of 0.9938–1.0227 for RMSE, 0.7512–0.7701 for MAE, and 0.8574–0.8667 for R 2 . Feature-importance analysis indicated that PCA-derived cell-line components contributed substantially to prediction, while PCA loading back-projection provided a partial link between latent cell-line components and the original feature space. Residual and calibration analyses showed no major systematic bias, although prediction error was higher in extreme response ranges.

Conclusion:

This study demonstrates that a transparent multimodal machine–learning workflow can achieve strong benchmark-based performance for anticancer drug-response prediction using TDC GDSC2. The contribution is methodological and benchmark-focused, emphasizing reproducible preprocessing, multimodal integration, model comparison, uncertainty estimation, and post hoc interpretability rather than direct clinical or therapeutic recommendation.

More from our Archive