A Cross-Sectional Study of Large Language Models in Lung Cancer Information Delivery: Readability, Quality, and Patient-Centred Evaluation
Ömer Önal, Suzan Temiz BekceBackground/Objectives: Lung cancer is a leading cause of cancer-related mortality worldwide. As patients increasingly utilize large language models (LLMs) for health information, evaluating the readability and patient-centeredness of these tools is critical. This study aims to compare the performance of ChatGPT-4o mini, Microsoft Copilot, and Google Gemini in providing lung cancer information, focusing on their utility for individuals with limited health literacy. Methods: In this cross-sectional study (March 2026), 30 responses to ten standardized lung cancer-related queries were analyzed. Outputs were assessed using JAMA benchmarks and mDISCERN for quality, the SMOG index for readability, and PEMAT-P for understandability and actionability. Inter-rater reliability was analyzed using intraclass correlation coefficients (ICCs). Results: ChatGPT-4o mini demonstrated superior readability, achieving a sixth-grade level (SMOG: 6.23 ± 0.72, p < 0.001). Gemini achieved higher JAMA scores, indicating stronger academic rigour. While PEMAT-P scores were highest for ChatGPT (63.7%), all models exhibited moderate mDISCERN quality. Inter-rater reliability was excellent for JAMA (ICC = 1.000) and PEMAT-P (ICC = 0.883), though moderate for mDISCERN (ICC = 0.365), reflecting inherent interpretative subjectivity in qualitative assessment. No hallucinations were observed. Conclusions: Current LLMs exhibit a trade-off between accessibility and academic rigour: ChatGPT favours patient-friendly readability, while Gemini emphasizes structured content. The observed inter-rater variability in mDISCERN underscores the complexity of standardizing qualitative AI evaluation. These findings suggest that LLMs function best as complementary aids rather than substitutes for physician-led communication.