Evaluating ChatGPT’s accuracy in predicting postoperative nausea and vomiting risk and antiemetic prophylaxis planning: A study on simulated patient profiles

doi:10.4103/sja.sja_212_26

DOI: 10.4103/sja.sja_212_26 ISSN: 1658-354X

Evaluating ChatGPT’s accuracy in predicting postoperative nausea and vomiting risk and antiemetic prophylaxis planning: A study on simulated patient profiles

Hubba Ahmed, Tooba Usman, Alina Mahmood, Umama Masnoon, Maria Hashmi, Muhammad M. Ali

Show PDF Cite

ABSTRACT

Background:

Postoperative nausea and vomiting (PONV) is a distressing condition following general anesthesia. The Apfel Simplified Score (SRS) is used to calculate the score and guide antiemetic prophylaxis as per Society of Ambulatory Anesthesia (SAMBA) guidelines. The use of generative AI models in exploring the risk calculation and guiding antiemetic prophylaxis remains unexplored.

Objective:

The study aims to investigate the accuracy of ChatGPT 4.0 unpaid version in determining the Apfel score and its adherence to the SAMBA antiemetic prophylaxis guideline recommendations using simulated patient profiles.

Methodology:

Our study was conducted in the Department of Anesthesiology on 100 simulated patient profiles. The study was completed in 2 months after approval by the Institutional Review Board of Dow University of Health Sciences. Data were collected and analyzed using SPSS. A pilot study was conducted to determine the sample size for the study. The simulated profiles were generated by the researchers, and then ChatGPT was asked to calculate their Apfel scores. Experienced anesthesiologists were asked to calculate the same variables for the profiles, but they were blinded to ChatGPT’s responses.

Results:

Among 100 simulated patient profiles, 99 profiles were scored correctly by ChatGPT and showed near-perfect agreement with anesthesiologists (Cohen’s κ = 0.975, P < 0.001), with 98% concordance in risk stratification (κ = 0.953, P < 0.001). However, ChatGPT’s adherence to SAMBA anti-emetic prophylaxis guidelines was low, showing only 31% correct recommendations in simulated profiles (κ = 0.141, P < 0.001).

Conclusion:

ChatGPT can be reliably used to determine the correct Apfel score of the patients, and it can accurately classify patients into risk categories. However, its recommendations of antiemetic agents as per the SAMBA guidelines are inconsistent, necessitating human oversight for guideline-based management. Such large language models can be used as adjuncts in perioperative assessments and cannot replace human judgment.

Outline

Evaluating ChatGPT’s accuracy in predicting postoperative nausea and vomiting risk and antiemetic prophylaxis planning: A study on simulated patient profiles

ABSTRACT

Background:

Objective:

Methodology:

Results:

Conclusion:

More from our Archive