Reliability and readability of adenoid hypertrophy information generated by five publicly accessible LLM chatbots: A default-setting snapshot study

doi:10.1177/20552076261459596

DOI: 10.1177/20552076261459596 ISSN: 2055-2076

Reliability and readability of adenoid hypertrophy information generated by five publicly accessible LLM chatbots: A default-setting snapshot study

Xiaoming Qian, Zhishui Wu, Jing Li, Qiuyu Su, Qian Qin, Beibei Zhang

Objective

This study evaluated the performance of five major large language model (LLM) chatbots in generating patient-oriented information on adenoid hypertrophy, focusing on content reliability and readability.

Methods

Sixty-three frequently asked questions (FAQs) on adenoid hypertrophy were collected, covering seven domains including etiology, symptoms, and treatment. From October 1, 2025, to January 10, 2026, questions were submitted in English to five LLMs via their official web interfaces. Reliability was assessed using DISCERN, EQIP, JAMA benchmarks, and the Global Quality Scale (GQS). Readability was measured by six standard indices (ARI, CLI, FKGL, GFI, SMOG, FRES). Three otorhinolaryngology clinicians blindly scored all responses.

Results

Significant differences in reliability were found among models ( P <0.001). Perplexity scored highest on DISCERN (41.98±1.87) and EQIP (58.40±3.67), followed by Copilot; ChatGPT and DeepSeek scored lowest. Only Copilot and Perplexity scored 1 point on JAMA benchmarks. No model met the recommended sixth-grade reading level. Gemini had the best readability (FRES: 61.95±9.64), while Copilot scored poorest (FRES: 24.27±10.77). All models failed to meet the recommended sixth-grade readability thresholds. ( P <0.001).

Conclusion

Current LLMs show a notable imbalance between reliability and readability in generating adenoid hypertrophy information, with none excelling in both. In this default-setting, product-level snapshot, Perplexity showed higher information-quality scores, whereas Gemini generated comparatively easier-to-read responses. These findings should not be interpreted as a controlled benchmark of underlying base models. Limitations include potential prompt sensitivity, single-response sampling, and the snapshot nature of the assessment given rapid model updates. Future improvements should focus on source transparency, text simplification, and condition-specific evaluation to enhance AI-assisted health communication for pediatric care.

Outline

Reliability and readability of adenoid hypertrophy information generated by five publicly accessible LLM chatbots: A default-setting snapshot study

Objective

Methods

Results

Conclusion

More from our Archive