Analysis of Accuracy and Response Stability of Artificial Intelligence Models in Answering Diagnostic, Therapeutic and Prognostic Questions on Dental Trauma
Songul Kilic, Ilke Doga SekerABSTRACT
This study comparatively evaluated accuracy and response stability of artificial intelligence (AI)‐models in answering diagnostic, therapeutic and prognostic questions related to dental trauma (DT), based on International Association of Dental Traumatology (IADT) guidelines. Fifty multiple‐choice questions derived from IADT guidelines were categorised into diagnostic ( n = 13), therapeutic ( n = 30) and prognostic ( n = 7) domains and administered to eight AI‐models once weekly over three consecutive weeks. Responses were coded as correct (1) or incorrect (0) and analysed. Statistical significance was set at p < 0.05. Most models showed significant response stability, except ChatGPT‐4.5 ( κ = −0.007, p = 0.934). ChatGPT‐5 showed moderate agreement ( κ = 0.479, p < 0.05). Accuracy differed among models for therapeutic and overall questions ( p < 0.05), but not for diagnostic or prognostic domains ( p > 0.05). Model type significantly affected accuracy ( p = 0.001), whereas question category ( p = 0.259) and time ( p = 0.436) had no effect. AI models showed heterogeneous performance. High accuracy did not necessarily correspond to response stability, as observed for ChatGPT‐4.5, indicating that these systems should be used cautiously and only as supplementary tools within a structured multiple‐choice framework.