DOI: 10.3390/diagnostics16132064 ISSN: 2075-4418

Comparative Evaluation of Large Language Models for Reporting Jaw Lesions on Panoramic Radiographs

Duygu Çelik Özen, Okan Özen, Utku Tuğberk Göktürk, Hamzahan Solak, Şuayip Burak Duman

Background/Objectives: The aim of this study was to assess the diagnostic capabilities of three large language model-based artificial intelligence chatbots (ChatGPT 4.0, Gemini 2.5, and Microsoft Copilot) in the radiographic evaluation of jaw lesions on panoramic images with different densities (mixed, radiolucent, and radiopaque). Methods: 120 panoramic radiographs showing jaw lesions with varying radiographic appearances were independently analyzed using three artificial intelligence chatbot systems. Each model was provided with the same single-round prompt and a standardized diagnostic scoring framework encompassing lesion structure, configuration, border characteristics, morphology, relationship with teeth, effects on adjacent structures, biological behavior indicators, and total diagnostic scores. Descriptive statistics were reported as mean ± standard deviation and median (minimum-maximum). Differences between LLM scores were analyzed using the Kruskal–Wallis test, followed by Bonferroni-corrected post hoc comparisons. The statistical significance level was set at p < 0.05. Results: Significant differences were observed among the LLMs across multiple diagnostic categories, including lesion structure, configuration, border characteristics, and total scores (p < 0.05). Gemini achieved the highest total scores in radiolucent (11.49 ± 4.97) and mixed lesions (9.01 ± 5.78), whereas ChatGPT showed slightly higher performance in radiopaque lesions (10.93 ± 2.88). Copilot demonstrated the lowest overall performance across all lesion categories. Conclusions: Large language model–based artificial intelligence chatbots showed variable performance in the panoramic radiographic evaluation of jaw lesions with radiolucent, radiopaque, and mixed patterns, suggesting potential utility as supportive tools. However, further validation studies are required before routine clinical implementation.

More from our Archive