DOI: 10.31832/smj.1855415 ISSN: 2146-409X

The Effectiveness of Chatgpt-4o in the Evaluation of Traumatic Brain Ct Imaging

Ahmet Öztürk, Gurbet Yanarateş, Serkan Günay, Erdal Komut, Seval Komut, Yavuz Yiğit
Objective: Large language models (LLMs) such as GPT-4o have recently introduced real-time multimodal capabilities, including medical image interpretation. No prior studies have systematically assessed GPT-4o’s diagnostic accuracy for brain CT scans in trauma patients. We aimed to evaluate GPT-4o’s performance in identifying intracranial pathologies on brain CT images in this setting.Methods: This retrospective cross-sectional study included adult patients presenting with head trauma between January and June 2024. For each patient, four representative CT slices were selected by a board-certified radiologist, whose interpretations served as the reference standard. The selected images were analysed by GPT-4o. Model outputs were compared with the radiologist’s assessments, and diagnostic accuracy metrics were calculated.Results: A total of 54 patients were included. Observed pathologies comprised epidural hematoma (22.2%), subdural hematoma (44.4%), subarachnoid hemorrhage (57.4%), parenchymal hemorrhage/contusion (29.6%), pneumocephalus (13.0%), and intraventricular hemorrhage (3.7%). GPT-4o correctly identified all present pathologies in 7.4% of cases and at least one pathology in 24.1%. In 68.5% of cases, no pathology was correctly detected. Sensitivity was low across all categories: epidural hematoma 8.3% (AUC 0.506), subdural hematoma 25.0% (AUC 0.508), subarachnoid hemorrhage 3.2% (AUC 0.451), parenchymal hemorrhage/contusion 50.0% (AUC 0.592), and pneumocephalus 14.3% (AUC 0.571). GPT-4o also generated false positives across multiple pathology categories.Conclusions: GPT-4o demonstrated insufficient diagnostic accuracy for the independent interpretation of brain CT scans in trauma patients, with low sensitivity and frequent misclassification. These findings underscore the necessity for task-specific training and rigorous validation in larger, multicentre studies prior to clinical implementation.

More from our Archive