Service-Specific Heterogeneity in Sepsis Variable Significance and Machine Learning Model Performance: A Stratified Analysis of the BIAlert Cohort
Marcio Borges-Sa, Eric Macias-Fassio, Alejandro Delgado, Santiago Salas-Sosa, María Aranda, Antonia Socias, Alberto del Castillo, Andres GiglioBackground/Objectives: Sepsis detection relies on clinical variables and scoring systems assumed to perform uniformly across hospital settings. However, sepsis phenotype distributions shift between clinical environments, suggesting that variable importance may be setting dependent. This study aimed to quantify service-specific variability in the discriminatory capacity of clinical variables for sepsis detection and to evaluate whether this heterogeneity translates into differential performance of machine learning models compared to traditional clinical scoring systems. Methods: This stratified sub-analysis of the BIAlert Sepsis cohort (203,755 patients; 11,864 sepsis episodes, 2014–2018) evaluated 61 structured quantitative variables across nine hospital services (≥90 sepsis episodes each). Within each service, the Mann–Whitney–Wilcoxon test (p < 0.01, Holm-corrected) assessed differences between septic and non-septic episodes. Five machine learning models (Random Forest/BIAlert, XGBoost, CatBoost, SVM, Neural Network) and three clinical rules (NEWS, SIRS, qSOFA) were evaluated globally and stratified across four clinical environments. Results: The proportion of significant variables ranged from 95.1% in the Emergency Department (58/61) to 37.7% in the Intensive Care Unit (23/61). Lactate was the only universally significant variable (9/9 services). Clinical scoring systems collapsed in Critical Care (qSOFA and NEWS AUC 0.459). BIAlert maintained the highest AUC across all environments (0.975–0.857). The Friedman test confirmed significant differences (χ2 = 28.00, p < 0.001), with BIAlert achieving a mean rank of 1.0. Conclusions: The discriminatory capacity of clinical variables for sepsis detection is not uniform across hospital services. ML models, particularly BIAlert, maintained robust performance where fixed-rule scoring systems failed.