DOI: 10.1111/anae.70237 ISSN: 0003-2409

Machine learning vs. traditional methods for predicting postoperative cardiac complications after non‐cardiac surgery: a systematic review and Bayesian network meta‐analysis

Saavan Dhaliwal, Shichao Chen, Chris Papas, Ian Hughes, David Cavalucci, Nicholas O'Rourke

Summary

Introduction

Accurate prediction of peri‐operative cardiac complications is critical to optimise pre‐operative decision‐making. Traditional risk prediction scores, such as the Revised Cardiac Risk Index, show only modest discrimination. Machine learning can model complex, non‐linear relationships but their predictive performance compared with traditional scores remains unclear.

Methods

We performed a systematic review and Bayesian network meta‐analysis. The primary outcome was postoperative adverse cardiac events following non‐cardiac surgery. Prediction models were assessed relative to the Revised Cardiac Risk Index. As many studies evaluated multiple versions of each model type, the highest performing (‘best version’) and lowest performing (‘worst version’) results were analysed. Models were ranked using the surface under the cumulative ranking curve ( SUCRA ).

Results

Thirteen studies evaluating 54 models and 927,113 patients were included. Machine learning approaches generally outperformed traditional risk scores. Automated machine learning ranked highest (SUCRA 96.6) showed the greatest improvement in the best version analysis (mean difference (MD) 0.28 (95%CrI 0.16–0.40)) and remained superior in the sensitivity analysis (MD 0.30 (95%CrI 0.14‐0.45)). Gradient boosting models showed superior performance over the Revised Cardiac Risk Index across analysis (best version: MD 0.20 (95%CrI 0.14–0.26), worst version: MD 0.18 (95%CrI 0.12–0.25), SUCRA 82.4). The Gupta Perioperative Risk for Myocardial Infarction or Cardiac Arrest score outperformed the Revised Cardiac Risk Index in the best version analysis (MD 0.16 (95%CrI 0.01–0.32)). Between‐study heterogeneity was low. None of the included studies externally validated their machine learning models and only six were judged to be at low risk of bias.

Discussion

Most machine learning models showed better discrimination than traditional risk scores, with automated machine learning and gradient boosting models ranking highest. However, study quality, calibration reporting and absence of external validation limit immediate clinical adoption. Prospective, multicentre evaluation is required before integration of these models into peri‐operative practice.

More from our Archive