Reliability and readability of adenoid hypertrophy information generated by five publicly accessible LLM chatbots: A default-setting snapshot study
Xiaoming Qian, Zhishui Wu, Jing Li, Qiuyu Su, Qian Qin, Beibei ZhangObjective
This study evaluated the performance of five major large language model (LLM) chatbots in generating patient-oriented information on adenoid hypertrophy, focusing on content reliability and readability.
Methods
Sixty-three frequently asked questions (FAQs) on adenoid hypertrophy were collected, covering seven domains including etiology, symptoms, and treatment. From October 1, 2025, to January 10, 2026, questions were submitted in English to five LLMs via their official web interfaces. Reliability was assessed using DISCERN, EQIP, JAMA benchmarks, and the Global Quality Scale (GQS). Readability was measured by six standard indices (ARI, CLI, FKGL, GFI, SMOG, FRES). Three otorhinolaryngology clinicians blindly scored all responses.
Results
Significant differences in reliability were found among models (
Conclusion
Current LLMs show a notable imbalance between reliability and readability in generating adenoid hypertrophy information, with none excelling in both. In this default-setting, product-level snapshot, Perplexity showed higher information-quality scores, whereas Gemini generated comparatively easier-to-read responses. These findings should not be interpreted as a controlled benchmark of underlying base models. Limitations include potential prompt sensitivity, single-response sampling, and the snapshot nature of the assessment given rapid model updates. Future improvements should focus on source transparency, text simplification, and condition-specific evaluation to enhance AI-assisted health communication for pediatric care.