Large language models in radiology exams: A cross-sectional comparative analysis of performance in Turkish and English
Şahinde Atlanoğlu, Mehmet Ali Gedik
This study evaluated the success of large language models on radiology questions, analyzing language variations, temporal consistency, and performance against residents. We evaluated ChatGPT-5, Grok-4, Claude 4.5 Sonnet, and Gemini 2.5 Pro using 100 multiple-choice questions across 5 subspecialties. Linguistic impact (Turkish vs English) and 1-week temporal reliability were assessed. Performance was benchmarked against a control group of 18 radiology residents (years 1–3). Gemini 2.5 Pro achieved the highest accuracy (90%), followed by Claude 4.5 Sonnet (86%). All models and 3rd-year residents significantly outperformed junior residents. While no significant performance gap existed between languages (