Machine learning-based modeling of pharmaceutical sorption in soils: Integrating conformal prediction and Shapley additive explanations analysis for robust risk assessment
Abdul Ghafoor, Muhammad Munir, Muhammad TahirAbstract
Pharmaceuticals are emerging soil contaminants introduced through wastewater effluents, biosolids application, and reclaimed-water irrigation. Because sorption governs their mobility and potential transfer to groundwater, robust prediction across soils and compounds is essential, yet remains difficult for ionizable pharmaceuticals due to pH-dependent speciation and multiple sorption mechanisms. Here, we developed an uncertainty-aware machine-learning framework to predict pharmaceutical soil sorption under realistic generalization to previously unseen compounds. A pharmaceutical-focused dataset was compiled from a large adsorption database, and models were trained using leave-compound-out grouped cross-validation (K = 5). ExtraTrees (ET), Random Forest (RF), XGBoost, ridge regression, and a compact tabular convolutional neural networks were benchmarked using pooled out-of-fold Q2, root mean square error (RMSE), and mean absolute error (MAE). Predictive uncertainty was quantified with 90% split conformal prediction intervals. Soils were further grouped into three sorption-capacity archetypes using a sorption capacity index, and Shapley additive explanations analysis and response-surface visualization were used to interpret pH and concentration effects. ExtraTrees showed the best predictive performance (Q2 = 0.524, RMSE = 0.90, MAE = 0.64 log units). Conformal intervals achieved nominal 90% coverage for all models, with the narrowest intervals for ET and RF. Model interpretation identified experimental conditions (logCe, pH), soil retention proxies (soil organic carbon, cation exchange capacity, clay, soil pH), and pH-dependent speciation fractions as dominant predictors. Response-surface maps across pH and concentration revealed heterogeneous mobility patterns, with coarse-textured, low-organic-matter soils representing priority monitoring zones. Overall, tree-ensemble models, particularly ET, provided the most reliable and well-calibrated predictions of pharmaceutical sorption.