DOI: 10.1161/circ.148.suppl_1.16401 ISSN: 0009-7322

Abstract 16401: Optimizing ChatGPT to Detect VT Recurrence From Complex Medical Notes

Ruibin Feng, Kelly A Brennan, Zahra Azizi, Jatin Goyal, Maxime Pedron, Hui Ju Chang, Prasanth Ganesan, Samuel Ruiperez-Campillo, Brototo Deb, Paul L Clopton, Tina Baykaner, Albert J Rogers, Sanjiv M Narayan
  • Physiology (medical)
  • Cardiology and Cardiovascular Medicine

Introduction: Large language models (LLMs), such as ChatGPT, have remarkable ability to interpret natural language using text questions (prompts) applied to gigabytes of data in the world wide web. However, the performance of ChatGPT is less impressive when addressing nuanced questions from finite repositories of lengthy, unstructured clinical notes (Fig A).

Hypothesis: The performance of ChatGPT to identify sustained ventricular tachycardia (VT) or fibrillation after ablation from free-text medical notes is improved by optimizing the question and adding in-context sample notes with correct responses (‘prompt engineering’).

Methods: We curated a dataset of N = 125 patients with implantable defibrillators (32.0% female, LVEF 48.9±13.9%, 61.7±14.0 years), split into development (N = 75) and testing (N = 50) sets of 307 and 337 notes, with 256.8±95.1 and 289.8±103 words, respectively. Notes were deidentified. Gold standard labels for recurrent VT (Yes, No, Unknown) were provided by experts. We applied GPT-3.5 to the test set (N=337 notes), using 1 of 3 prompts (“Does the patient have sustained VT or VF after ablation” or 2 others), systematically adding 1-5 “training” examples, and repeating experiments 10 times (51,561 inquiries).

Results: At baseline, GPT achieved an F1 score of 38.6%±19.4% (mean across 3 prompts; Fig B). Increasing the number of examples progressively improved mean accuracy and reduced variance. The optimal result was the illustrated prompt plus 5 in-context examples, with an F1 score of 84.6%±6.4% (p<0.05).

Conclusions: ChatGPT can accurately identify VT recurrence from small numbers of complex medical notes with optimal prompt engineering. Future studies should define optimal context for different medical questions and domains. These findings pave the way for automated analysis of large medical repositories to broadly improve decision making.

More from our Archive