Service-Specific Heterogeneity in Sepsis Variable Significance and Machine Learning Model Performance: A Stratified Analysis of the BIAlert Cohort

doi:10.3390/jcm15134904

DOI: 10.3390/jcm15134904 ISSN: 2077-0383

Service-Specific Heterogeneity in Sepsis Variable Significance and Machine Learning Model Performance: A Stratified Analysis of the BIAlert Cohort

Marcio Borges-Sa, Eric Macias-Fassio, Alejandro Delgado, Santiago Salas-Sosa, María Aranda, Antonia Socias, Alberto del Castillo, Andres Giglio

Show PDF Cite

Background/Objectives: Sepsis detection relies on clinical variables and scoring systems assumed to perform uniformly across hospital settings. However, sepsis phenotype distributions shift between clinical environments, suggesting that variable importance may be setting dependent. This study aimed to quantify service-specific variability in the discriminatory capacity of clinical variables for sepsis detection and to evaluate whether this heterogeneity translates into differential performance of machine learning models compared to traditional clinical scoring systems. Methods: This stratified sub-analysis of the BIAlert Sepsis cohort (203,755 patients; 11,864 sepsis episodes, 2014–2018) evaluated 61 structured quantitative variables across nine hospital services (≥90 sepsis episodes each). Within each service, the Mann–Whitney–Wilcoxon test (p < 0.01, Holm-corrected) assessed differences between septic and non-septic episodes. Five machine learning models (Random Forest/BIAlert, XGBoost, CatBoost, SVM, Neural Network) and three clinical rules (NEWS, SIRS, qSOFA) were evaluated globally and stratified across four clinical environments. Results: The proportion of significant variables ranged from 95.1% in the Emergency Department (58/61) to 37.7% in the Intensive Care Unit (23/61). Lactate was the only universally significant variable (9/9 services). Clinical scoring systems collapsed in Critical Care (qSOFA and NEWS AUC 0.459). BIAlert maintained the highest AUC across all environments (0.975–0.857). The Friedman test confirmed significant differences (χ2 = 28.00, p < 0.001), with BIAlert achieving a mean rank of 1.0. Conclusions: The discriminatory capacity of clinical variables for sepsis detection is not uniform across hospital services. ML models, particularly BIAlert, maintained robust performance where fixed-rule scoring systems failed.

Outline

Service-Specific Heterogeneity in Sepsis Variable Significance and Machine Learning Model Performance: A Stratified Analysis of the BIAlert Cohort

More from our Archive