Large language models in radiology exams: A cross-sectional comparative analysis of performance in Turkish and English

doi:10.1097/md.0000000000049517

DOI: 10.1097/md.0000000000049517 ISSN: 0025-7974

Large language models in radiology exams: A cross-sectional comparative analysis of performance in Turkish and English

Şahinde Atlanoğlu, Mehmet Ali Gedik

This study evaluated the success of large language models on radiology questions, analyzing language variations, temporal consistency, and performance against residents. We evaluated ChatGPT-5, Grok-4, Claude 4.5 Sonnet, and Gemini 2.5 Pro using 100 multiple-choice questions across 5 subspecialties. Linguistic impact (Turkish vs English) and 1-week temporal reliability were assessed. Performance was benchmarked against a control group of 18 radiology residents (years 1–3). Gemini 2.5 Pro achieved the highest accuracy (90%), followed by Claude 4.5 Sonnet (86%). All models and 3rd-year residents significantly outperformed junior residents. While no significant performance gap existed between languages ( P = 1.000), Claude 4.5 Sonnet demonstrated superior temporal reliability (κ = 0.872) compared with the moderate consistency of Grok-4 and ChatGPT-5. High-performance large language models provide accurate radiology knowledge comparable with senior residents, showing significant potential for education. Future research must incorporate image-based datasets to determine clinical efficacy.

Outline

Large language models in radiology exams: A cross-sectional comparative analysis of performance in Turkish and English

More from our Archive