Artificial Intelligence Models for Mortality and Outcome Prediction in Intensive Care Unit Sepsis: A Systematic Review
Giuseppe Mazza, Giuseppe Neri, Helenia Mastrangelo, Alessandro Russo, Isabella Aquila, Matteo Antonio Sacco, Jessica Ielapi, Corrado Pelaia, Mario Cannataro, Chiara Lupia, Francesca Serapide, Federico Longhini, Vincenzo Bosco, Zaninni Caroleo, Andrea Bruni, Eugenio Garofalo,Background/Objectives: Artificial intelligence (AI), machine-learning (ML), and deep-learning (DL) models are increasingly used for prognostic prediction in intensive care unit (ICU) sepsis, but their clinical readiness remains uncertain. This systematic review aimed to evaluate AI-, ML-, and DL-based models for mortality and clinically relevant outcome prediction in adult ICU patients with sepsis or septic shock. Methods: PubMed/MEDLINE, Scopus, and the Cochrane Library were searched up to April 2026. Eligible studies included adult ICU sepsis or septic shock cohorts evaluating AI/ML/DL-based prognostic models. Screening, full-text assessment, and data extraction were performed independently by two reviewers. Outcomes, model families, validation strategies, discrimination, calibration, clinical utility, explainability, comparative performance versus conventional severity scores, risk of bias, and reporting completeness were synthesized. Risk of bias was assessed using PROBAST domains supplemented by PROBAST + AI considerations, and reporting completeness was evaluated according to TRIPOD/TRIPOD + AI domains. Results: Seventy-five studies were included, comprising 50 PubMed-derived and 25 additional Scopus-derived studies. AUROC or C-statistic was extractable in 64 studies, external validation was reported in 27, prospective evaluation in three, calibration in 38, decision-curve analysis or clinical utility assessment in 37, and explainability in 64. Across 17 directly extractable within-study comparisons from nine studies, AI/ML models usually, but not uniformly, achieved higher discrimination than conventional severity scores, with a median paired ΔAUROC of +0.108 (IQR, +0.082 to +0.148; range, −0.013 to +0.203). Externally validated fixed-horizon models showed clinically relevant but heterogeneous discrimination across sepsis phenotypes, with stronger evidence in selected sepsis-induced coagulopathy cohorts and more variable transportability in respiratory and liver-injury subgroups. However, 45 studies were judged at high risk of bias, mainly because of limitations in the analysis domain. Conclusions: AI/ML models for adult ICU sepsis show a recurrent signal of prognostic discrimination and often perform comparably to or better than conventional severity scores in directly extractable within-study comparisons; however, this signal should be interpreted cautiously given clinical and methodological heterogeneity, limited prospective validation, incomplete calibration, and frequent high risk of bias. The strongest evidence comes from externally validated, phenotype-specific models, although routine clinical implementation remains limited by heterogeneous endpoints, incomplete calibration, insufficient prospective validation, and scarce workflow-level evaluation. Future studies should shift from retrospective AUROC optimization toward calibrated, externally validated, clinically actionable, and workflow-integrated decision-support tools tested in prospective ICU settings.