DOI: 10.65092/autfm.1878887 ISSN: 0365-8104

Diagnostic Accuracy of ChatGPT in Shoulder Fractures: A Comparative Study with an Emergency Medicine Specialist and a Radiologist

Osman Konukoğlu, Murat Kaya, Şeyma Eseoğlu Pekcan, İsa Günaydin
BackgroundChat Generative Pre-Trained Transformer (ChatGPT) is a new large language model capable of simulating real-life conversations and providing diagnostic support. However, its performance in detecting fractures in shoulder traumas has not been investigated. Objective To evaluate the diagnostic accuracy of ChatGPT in detecting shoulder fractures and to compare its performance with an emergency medicine specialist and a radiologist.MethodsThis retrospective study included 197 patients who underwent both shoulder radiography and computed tomography (CT) between September 2023 and July 2025. One emergency medicine specialist, one radiologist, and ChatGPT models (4o and 4.5) independently and blindly evaluated anonymized radiographs to determine the presence and location of fractures. CT was accepted as the gold standard. Diagnostic performance was assessed using sensitivity, specificity, accuracy, and area under the curve (AUC) values.ResultsAmong 197 patients, 74 (37.56%) had fractures, most commonly involving the humerus. The radiologist demonstrated the highest diagnostic performance (AUC: 0.903; Accuracy: 90.86%), followed by the emergency physician (AUC: 0.824; Accuracy: 82.74%). ChatGPT-4o (AUC: 0.641; Accuracy: 61.93%) and ChatGPT-4.5 (AUC: 0.626; Accuracy: 57.36%) showed statistically significant but weak-to-moderate diagnostic contribution, with high sensitivity but poor specificity. Both models demonstrated stable intra-rater agreement.ConclusionChatGPT models showed limited performance compared with clinicians in diagnosing shoulder fractures. While they cannot replace clinical expertise, they may serve as supportive tools in decision-making.

More from our Archive