Comparative Evaluation of Claude, Gemini, and ChatGPT in Responding to Patient Inquiries About Hyperprolactinemia and Prolactinoma: A Multi-Rater Cross-Sectional Study Using English-Language Queries

doi:10.59518/farabimedj.1928053

DOI: 10.59518/farabimedj.1928053 ISSN: 2979-9821

Comparative Evaluation of Claude, Gemini, and ChatGPT in Responding to Patient Inquiries About Hyperprolactinemia and Prolactinoma: A Multi-Rater Cross-Sectional Study Using English-Language Queries

İsmail Engin, Melda Çelik Tufan, Esma Merve Arda Özkan

Objective: This cross-sectional comparative study aimed to compare the accuracy and adequacy of responses generated by three state-of-the-art large language models (LLMs) Claude Opus 4.6, Gemini 3 Pro, and ChatGPT GPT-5.4 to 50 frequently asked English-language questions about hyperprolactinemia and prolactinoma, as evaluated by three independent endocrinologists.Methods: Questions compiled from outpatient practice were presented using a zero-shot, plain-query approach through the publicly available consumer web interfaces. Responses were independently evaluated by three blinded endocrinologists using a 6-point Likert scale for accuracy and a 5-point Likert scale for adequacy. For each question, the three evaluators’ ratings were averaged to a question-level score, producing 50 question-level scores per model. Inter-model comparisons used the Friedman test with Bonferroni-corrected Wilcoxon signed-rank post hoc tests (adjusted α = .017); within-model comparisons across thematic categories used the Kruskal–Wallis test.Results: Claude achieved the highest mean accuracy (5.76 ± 0.64; 95% CI 5.58–5.93) and adequacy (4.52 ± 0.52; 95% CI 4.38–4.67), followed by Gemini (accuracy 5.58 ± 0.73; adequacy 4.31 ± 0.63) and ChatGPT (accuracy 5.39 ± 0.90; adequacy 4.06 ± 0.82). The Friedman test was significant for both metrics (accuracy χ² = 50.54; adequacy χ² = 63.51; both p < .001). All three pairwise comparisons were significant after Bonferroni correction (all p < .001), following the order Claude > Gemini > ChatGPT, with large effect sizes (r = .63 - .72). The pregnancy period category yielded the lowest scores across all models.Conclusion: These findings demonstrate the capacity of LLMs to generate clinically accurate, guideline-concordant English-language content regarding prolactinoma under optimal linguistic conditions. However, because the evaluation was limited to English-language queries, the results cannot be generalized to non-English-speaking populations, and their use as patient education tools requires direct comprehension and readability studies in the target language.

Outline

Comparative Evaluation of Claude, Gemini, and ChatGPT in Responding to Patient Inquiries About Hyperprolactinemia and Prolactinoma: A Multi-Rater Cross-Sectional Study Using English-Language Queries

More from our Archive