Explainable and Optimized Gradient Boosting Algorithms for Near‐Real‐Time Prediction of Cyanobacterial Alert Levels in Freshwater Systems
Marcelo A. Cappelletti, María Belén Sathicq, M. Julissa Atía, Joaquín Cochero, Lucas M. Olivera, Jorge R. OsioABSTRACT
Harmful cyanobacterial blooms (HABs) pose serious risks to freshwater ecosystems, drinking water supplies, and public health, highlighting the need for reliable early‐warning systems. This study presents a rigorously validated machine learning framework for predicting cyanobacterial alert levels under strongly imbalanced conditions using routinely measured physicochemical variables. Four gradient boosting algorithms were systematically combined with 12 resampling strategies and evaluated within a nested cross‐validation framework to ensure unbiased performance assessment. Model evaluation incorporated metrics tailored to imbalanced classification, including recall, F1‐score, balanced accuracy (BA), and the Matthews correlation coefficient (MCC), with particular emphasis on the detection of alert events. Results demonstrate that resampling is critical for improving minority‐class detection, with SMOTE‐based approaches consistently providing the most favorable balance between sensitivity and precision across algorithms. LightGBM combined with SMOTE achieved the highest recall and F1‐score, together with strong BA and MCC values and low variability across folds, indicating robust generalization. XGBoost combined with SMOTE exhibited a more balanced precision–recall profile with comparable overall performance but higher variability. SHAP‐based interpretability analyses revealed consistent and ecologically meaningful drivers across models, with water temperature, turbidity, and pH emerging as the most influential predictors. By restricting inputs to variables measurable in near real time using low‐cost in situ sensors, the proposed framework is designed to support operationally feasible early‐warning applications through frequent updates of alert‐level predictions within environmental monitoring systems. Overall, the findings highlight the importance of addressing class imbalance, ensuring rigorous validation, and incorporating interpretability to support practical and operationally feasible cyanobacterial early‐warning applications.