DOI: 10.22531/muglajsci.1862743 ISSN: 2149-3596

Comparative Evaluation of Chatgpt, Gemini and Grok with and without Deep Research Mode in Answering Bone Augmentation Queries

Ali Batuhan Bayırlı, Mehmetcan Uytun, Ruşen Erdem, Yavuz Selim Genç
This study aimed to evaluate the information generation performance of four large language models (ChatGPT-o1-pro, ChatGPT-4o, Gemini 2.5 Flash, and Grok-3) using multidimensional criteria, based on their responses to expert-level questions in the field of intraoral bone augmentation. In addition, the study investigated the effect of enabling Deep Search mode on model performance. A total of 20 expert-generated questions, developed with input from domain specialists, were submitted to each model under two different retrieval configurations, including Deep Search. The responses were assessed independently by two periodontology experts in terms of accuracy, content quality, readability, and reading level. The findings indicated that the ChatGPT-o1-pro model with Deep Search achieved the highest and most balanced performance across all evaluation criteria. Similarly, ChatGPT-4o with Deep Search demonstrated strong performance, particularly in accuracy and content quality, although its readability was comparatively lower. In contrast, Gemini 2.5 Flash and Grok-3 showed relatively weaker performance across all dimensions. Overall, the results revealed that activating Deep Search mode provided a statistically significant improvement in performance. These findings underscore the growing potential of large language models in clinically complex and information-intensive domains such as dentistry. The results also suggest that, beyond model selection, the operational mode—particularly the use of advanced retrieval configurations—plays a crucial role in determining the quality of generated information.

More from our Archive