Evaluation of the Validity and Reliability of AI Chatbots in Response to Frequently Asked Questions About Teeth Whitening

doi:10.1111/jerd.70223

DOI: 10.1111/jerd.70223 ISSN: 1496-4155

Evaluation of the Validity and Reliability of AI Chatbots in Response to Frequently Asked Questions About Teeth Whitening

Ahmet Cankut Karamehmet, Fikret Yilmaz

Show PDF Cite

ABSTRACT

Objective

This study aimed to evaluate the validity and reliability of responses generated by GPT‐4o, Microsoft Copilot, Google Gemini, and DeepSeek to 20 frequently asked patient questions about tooth whitening.

Materials and Methods

Twenty common questions about tooth whitening were selected based on clinical experience and AI‐generated suggestions. Each question was submitted three times to each chatbot through its official web interface. The responses were evaluated by two professors and four specialists in restorative dentistry using a five‐point Likert scale based on a modified Global Quality Score. Validity was analyzed considering low‐threshold and high‐threshold criteria. Reliability was tested using Cronbach's alpha coefficient, whereas inter‐rater reliability was calculated utilizing the intraclass correlation coefficient.

Results

In the low‐threshold validity analysis, GPT‐4o and DeepSeek yielded the highest validity rate by providing valid responses to all 20 questions. Microsoft Copilot and Google Gemini showed lower validity rates. No significant difference was found among the chatbots in low‐threshold validity rates. In the high‐threshold validity analysis, GPT‐4o and DeepSeek showed the highest valid response rates, whereas Google Gemini and Microsoft Copilot showed lower rates. No significant difference was found among the chatbots in high‐threshold validity rates. In the reliability analysis, the highest internal consistency was observed for DeepSeek, followed by Microsoft Copilot, Google Gemini, and GPT‐4o.

Conclusions

The evaluated chatbots showed different performance levels in terms of the validity and reliability of their responses to frequently asked patient questions about tooth whitening. GPT‐4o and DeepSeek yielded the highest rates in the low‐threshold and high‐threshold validity analyses, whereas DeepSeek showed the highest internal consistency.

Clinical Significance

This study indicated that the evaluated AI chatbots generated generally valid but variable responses to frequently asked patient questions about tooth whitening. The findings support the professionally supervised use of chatbot‐generated information as supplementary patient education material in dentistry.

Outline

Evaluation of the Validity and Reliability of AI Chatbots in Response to Frequently Asked Questions About Teeth Whitening

ABSTRACT

Objective

Materials and Methods

Results

Conclusions

Clinical Significance

More from our Archive