Clinical Safety and Reliability of Large Language Models in Answering Hemorrhoid-Related Patient Questions: A Comparative Study of ChatGPT, Gemini, and DeepSeek

doi:10.3390/healthcare14131909

DOI: 10.3390/healthcare14131909 ISSN: 2227-9032

Clinical Safety and Reliability of Large Language Models in Answering Hemorrhoid-Related Patient Questions: A Comparative Study of ChatGPT, Gemini, and DeepSeek

Ethem Bilgiç, Erkan Karacan

Background: Large language models (LLMs) are increasingly used by patients for obtaining medical information; however, concerns remain regarding their clinical safety, reliability, and appropriateness of patient guidance. Evidence evaluating LLM performance in hemorrhoid-related patient questions remains limited. Objective: To compare the clinical accuracy, safety, and overall clinical adequacy of responses generated by ChatGPT, Gemini, and DeepSeek to hemorrhoid-related patient questions. Methods: In this cross-sectional comparative study, 25 hemorrhoid-related patient questions were developed and categorized into three predefined subgroups: basic informational questions, clinically significant scenarios, and misleading/risky patient statements. Responses generated by ChatGPT (GPT-5.3), Gemini (3.1), and DeepSeek (R1) were evaluated by two experienced surgeons using a consensus-based expert assessment approach and a structured 5-point scoring system assessing clinical accuracy, safety, appropriateness of patient guidance, and overall clinical adequacy. Critical errors and qualitative communication characteristics were also analyzed. Friedman and post hoc Conover tests with Bonferroni correction were used for statistical comparisons. Results: Overall response quality differed significantly among models (χ2(2) = 29.119, p < 0.001, Kendall’s W = 0.582). ChatGPT achieved the highest overall scores (5.00 ± 0.00), followed by Gemini (4.80 ± 0.41) and DeepSeek (4.12 ± 0.67). Significant differences were primarily observed between DeepSeek and the other models, whereas ChatGPT and Gemini showed comparable performance. Model divergence became more pronounced in clinically significant scenarios involving alarm symptoms, rectal bleeding, persistent symptoms, and acute anorectal pain. No model generated directly harmful medical recommendations or explicit guidance likely to result in substantial diagnostic delay. However, qualitative assessment demonstrated differences in communication style and risk communication. ChatGPT generally produced more balanced and context-appropriate responses, Gemini generated more explanatory responses, whereas DeepSeek showed a tendency toward disproportionately urgent or alarmist language in some higher-risk scenarios. Conclusions: Large language models demonstrated generally high clinical accuracy in answering hemorrhoid-related patient questions; however, notable model-specific differences were observed in clinical guidance, communication style, and risk communication, particularly in high-risk clinical scenarios. These findings suggest that LLMs may serve as useful supportive tools for patient education and health information delivery, although they should currently be regarded as systems that support rather than replace human clinical judgment.

Outline

Clinical Safety and Reliability of Large Language Models in Answering Hemorrhoid-Related Patient Questions: A Comparative Study of ChatGPT, Gemini, and DeepSeek

More from our Archive