Diagnostic Performance and Error Patterns of a Large Language Model and Neural Network in Periodontitis Classification: A Comparative Study
Agata Ossowska, Aida Kusiak, Albert Camlet, Dariusz ŚwietlikBackground/Objectives: Periodontitis is a highly prevalent chronic disease requiring accurate diagnosis for effective treatment planning. Artificial intelligence (AI) has emerged as a potential tool to support clinical decision-making. This study aimed to compare the diagnostic performance and classification error patterns of a large language model (LLM) and a neural network (NN) in periodontitis classification according to the current staging and grading system. Methods: This retrospective study included 110 patients with periodontal disease. Clinical and demographic variables (age, sex, smoking status, number of teeth, API, BOP, PPD, and CAL) were analyzed. Reference diagnoses were established by two experts. Cases were evaluated using an LLM and a neural network. Model performance was assessed using accuracy, confusion matrices, and Cohen’s kappa coefficient, along with error analysis. Results: The LLM achieved 62% accuracy for stage and 63% for grade classification (κ = 0.48). The neural network showed higher performance, with 85% accuracy for stage and 79% for grade (κ = 0.79 and κ = 0.67, respectively). The LLM more often underestimated disease severity, whereas the neural network tended to overestimate progression. Differences between models were statistically significant (p < 0.0001). Conclusions: In this dataset and classification task, the task-specific neural network demonstrated higher diagnostic performance than the evaluated large language model. However, the findings should be interpreted in light of the fundamentally different training paradigms and intended applications of these AI systems. Further research is required to optimize and validate AI-based approaches for clinical use.