Large language model chatbots as sources of pediatric anesthesia health advice: An evaluation of reliability and readability

doi:10.1177/20552076261464749

DOI: 10.1177/20552076261464749 ISSN: 2055-2076

Large language model chatbots as sources of pediatric anesthesia health advice: An evaluation of reliability and readability

Xue Zhang, Yuchen Dai, Xin Zhao, Lin Wu, Boming Shao, Xisheng Shan, Fuhai Ji, Runzhi Deng, Baojian Zhao

Background

Large language models are increasingly used to obtain health information, but their quality in pediatric anesthesia remains insufficiently evaluated. This study aimed to assess the reliability and readability of four widely used AI chatbots in this context.

Methods

This cross-sectional observational study developed 18 pediatric anesthesia-related questions using Medical Subject Headings terms, online search trend analysis, and commonly queried topics reflecting parental information needs. Each question was submitted under standardized conditions to four generative AI-driven chatbots: OpenAI’s GPT-5.1 Thinking, Google’s Gemini 3 Pro, Anthropic’s Claude Opus 4.5 Extended Thinking, and DeepSeek-V3.2-Speciale. Models were accessed in their vendor-deployed configurations without task-specific fine-tuning. The generated responses were evaluated for information reliability using the Ensuring Quality Information for Patients (EQIP) instrument, DISCERN tool, Global Quality Score (GQS), and Journal of the American Medical Association (JAMA) benchmark criteria. Readability was assessed using seven validated indices including Flesch Reading Ease Score, Flesch–Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook, Coleman–Liau Index, Automated Readability Index, and Linsear Write Formula.

Results

A total of 72 chatbot-generated responses were included for analysis. Significant between-model differences were observed in DISCERN, EQIP, and GQS, while JAMA benchmark scores were consistently low across all models. DeepSeek and Gemini showed higher median reliability scores across several instruments, although significant pairwise differences mainly involved ChatGPT. None of the evaluated models achieved the recommended sixth-grade readability level across any index. Correlations between reliability and readability were non-significant, suggesting that these represent independent dimensions of information quality.

Conclusions

Current LLM-based chatbots provided pediatric anesthesia information with variable reliability and consistently suboptimal readability. Although certain models demonstrated relatively higher information quality, limited transparency and excessive reading complexity may restrict their suitability for public-facing educational use. These findings highlight the need for improved quality control, enhanced transparency, and readability-focused optimization in pediatric perioperative education.

Outline

Large language model chatbots as sources of pediatric anesthesia health advice: An evaluation of reliability and readability

Background

Methods

Results

Conclusions

More from our Archive