Optimized and Interpretable Multimodal Machine Learning for Benchmark-based Drug-Response Prediction Using the Therapeutics Data Commons Genomics of Drug Sensitivity in Cancer 2 Dataset
Albina Ondabayeva
A
BSTRACT
Background:
Benchmark-based drug-response prediction is essential for reproducible pharmacogenomic machine–learning research. However, many studies insufficiently separate model-development stages, provide limited preprocessing transparency, and overstate biological interpretation from latent features.
Objective:
The objective is to develop and evaluate a reproducible multimodal machine–learning workflow for anticancer drug-response prediction using the Therapeutics Data Commons (TDC) Genomics of Drug Sensitivity in Cancer 2 (GDSC2) benchmark dataset.
Materials and Methods:
The TDC GDSC2 dataset was analyzed using the predefined benchmark split. Drug structures were encoded as 1024-bit Morgan fingerprints, while 17,737-dimensional cell-line features were standardized and reduced to 128 principal components, retaining approximately 61.8% of variance. Multimodal features were constructed by concatenating drug and cell-line representations. Ridge regression, ElasticNet, Random Forest, multilayer perceptron, XGBoost, and modality-specific baselines were compared. Hyperparameter optimization for XGBoost was performed using RandomizedSearchCV with three-fold cross-validation on the training set. The final model was retrained on combined training and validation data and evaluated once on the independent test set. Bootstrap resampling was used to estimate 95% confidence intervals. Model behavior was assessed using residual analysis, calibration-style evaluation, global feature importance, and principal component analysis (PCA) loading-based interpretability analysis.
Results:
The fusion XGBoost model achieved the strongest validation performance (root mean squared error [RMSE] = 1.0517, mean absolute error [MAE] = 0.7808,
Conclusion:
This study demonstrates that a transparent multimodal machine–learning workflow can achieve strong benchmark-based performance for anticancer drug-response prediction using TDC GDSC2. The contribution is methodological and benchmark-focused, emphasizing reproducible preprocessing, multimodal integration, model comparison, uncertainty estimation, and