DOI: 10.1093/bjs/znad258.155 ISSN:

900 Assessing the Use of GIRFT Guidelines and the Role of a Natural Language Processing Artificial Intelligence System for Operation Note Documentation in Emergency General Surgery – a Closed Loop Prospective Audit & Comparative Study

B Karki, L Rossi, A Peckham Cooper, J Burke
  • Surgery

Abstract

Aim

4.6 million NHS operation notes (ONs) are generated annually. The Getting it Right First Time (GIRFT) Initiative created procedure-specific guidelines for ON documentation. Natural Language Processing (NLP) Artificial Intelligence systems have the potential to improve documentation quality. This audit aimed to test the potential of GIRFT guidelines (P1) and a novel NLP programme against surgeons (P2) to improve ON documentation quality.

Method

P1: The five most recent, consecutive ONs from a tertiary-centre CEPOD list for Laparoscopic Appendicectomy (LA), Laparoscopic Cholecystectomy (LC), Open Inguinal Hernia (IH), Emergency Laparotomy and Small Bowel Resection (ELS) were collected, anonymised, and assessed against GIRFT guidelines by two independent reviewers. All surgeons were then educated on the four guidelines (intervention) and the audit loop was closed. P2: NLP ONs using ChatGPT (OpenAI-USA-Ver.15/12/22) were produced using five iterative response generations.

Results

P1: 40 ONs were assessed across 2200 marking points. Post-intervention, median documentation scores improved by +12.3% [p = 0.033] (LA), +0.6% [p = 0.500] (LC), +25.5% [p = 0.006] (IH) and +9.5% [p = 0.030] (ELS). The most accurate documentation was LA in cycle 2, median score 72.7% [IQR = 8.5].

P2: ChatGPT performance improved by a median of +29.7% [IQR = 9.4] between response generation 1 to 5. ChatGPT scores were inferior to cycle 2 surgeon scores: -37.4% [p = 0.006] (LA), -47.7% [p = 0.002] (LC), -39.6% [p = 0.002] (IH) and -32.3% [p = 0.006] (ELS).

Conclusions

Using GIRFT documentation guidelines can improve the quality of ON documentation. ChatGPT performed significantly worse than surgeons across four EGS procedures. However, average NLP scores improved by 1/3, by the fifth-generation response, indicating promise as a future adjunct.

More from our Archive