Comparative evaluation of radiological anatomy knowledge and accuracy of
                    <scp>ChatGPT</scp>
                    ‐5, Gemini 2.5, and Grok 4 across normal and thinking modes

doi:10.1002/ase.70292

DOI: 10.1002/ase.70292 ISSN: 1935-9772

Comparative evaluation of radiological anatomy knowledge and accuracy of ChatGPT ‐5, Gemini 2.5, and Grok 4 across normal and thinking modes

Ismail Sivri, Furkan Mehmet Ozden, Halit Celik, Ozgur Gokturk, Tuncay Colak

Abstract

This study compared the performance of three large language models, ChatGPT‐5 Plus, Gemini 2.5 Pro, and SuperGrok 4, in identifying anatomical structures on radiographic images using standardized anatomical terminology. Thirty radiographs from different body regions were selected from an open‐access atlas and analyzed by the models in Normal and Thinking modes using standardized prompts based on Terminologia Anatomica (version 2.07). Responses were evaluated independently by two anatomists using a 0–2 scoring system. Overall accuracy across both modes and models ranged from 47.4% to 85.7%. Data were analyzed using Friedman and Wilcoxon signed‐rank tests. Temporal response consistency was assessed with weighted kappa coefficients. Gemini 2.5 Pro and ChatGPT‐5 Plus significantly outperformed SuperGrok 4 in both modes. In Normal mode, Gemini 2.5 Pro achieved the highest overall accuracy (82.7%), significantly exceeding ChatGPT‐5 Plus (60.7%, p = 0.001) and SuperGrok 4 (47.4%, p < 0.001). In Thinking mode, accuracies were 85.7% for Gemini 2.5 Pro, 77.6% for ChatGPT‐5 Plus, and 49.5% for SuperGrok 4. Gemini 2.5 Pro demonstrated a significant advantage over ChatGPT‐5 Plus only in Normal mode ( p = 0.001), whereas Thinking mode significantly improved performance only for ChatGPT‐5 Plus ( p = 0.01). Temporal stability analysis showed high response consistency for Gemini 2.5 Pro and SuperGrok 4 across all modes ( r > 0.94, p < 0.001). Conversely, ChatGPT‐5 Plus' stability decreased from substantial agreement in normal mode ( r = 0.697, p < 0.001) to moderate agreement in Thinking mode ( r = 0.539, p < 0.001). Despite their educational potential, these models need refinement to reliably identify anatomical structures on radiographic images.

image

Outline

Comparative evaluation of radiological anatomy knowledge and accuracy of ChatGPT ‐5, Gemini 2.5, and Grok 4 across normal and thinking modes

Abstract

More from our Archive