DOI: 10.4103/jicdro.jicdro_106_25 ISSN: 2231-0754

Diagnostic Performance of Dental Professionals and a Vision-enabled Artificial Intelligence Model (ChatGPT-4o) in Radiographic Detection of Apical Lesions

Hanadi Sabban

Background:

This study evaluated the diagnostic performance of ChatGPT-4o, a vision-capable large language model, in identifying apical periodontitis on periapical radiographs. Its accuracy was compared to that of board-certified oral radiologists and endodontists. In addition, the study examined how clinical and radiographic co-factors influenced diagnostic outcomes.

Materials and Methods:

In this retrospective cross-sectional diagnostic accuracy study, 166 periapical radiographs were independently assessed by four reader groups: Board-certified oral radiologists, endodontists, baseline ChatGPT-4o, and ChatGPT-4o applied to single-tooth cropped images. Outcomes were binary (lesion present/absent). Metrics included overall accuracy, inter-rater agreement (Fleiss’ κ), area under the receiver operating characteristic curve, and multivariable logistic regression. Covariates were radiographic quality, root morphology, crestal bone loss, and tooth position. Images originated from a university dental hospital archive under institutional oversight.

Results:

Oral radiologists and endodontists showed higher diagnostic accuracy (57.1% and 56.8%) than ChatGPT-4o using cropped images (41.7%) or standard input (36.3%) ( P < 0.001), with moderate agreement between specialists (κ = 0.42) and negligible agreement between ChatGPT-4o and human evaluators. Diagnostic accuracy varied by lesion category and was reduced in the presence of radiographic errors (−25%), multi-rooted teeth (−15%), and crestal bone loss (−12%) (all P < 0.05).

Conclusion:

ChatGPT-4o performed below clinical experts for detecting apical lesions on periapical radiographs. Image cropping improved model results but did not reach expert performance. The study indicates that both radiographic errors and anatomical complexity substantially limit diagnostic accuracy.

More from our Archive