DOI: 10.59518/farabimedj.1928053 ISSN: 2979-9821

Comparative Evaluation of Claude, Gemini, and ChatGPT in Responding to Patient Inquiries About Hyperprolactinemia and Prolactinoma: A Multi-Rater Cross-Sectional Study Using English-Language Queries

İsmail Engin, Melda Çelik Tufan, Esma Merve Arda Özkan
Objective: This cross-sectional comparative study aimed to compare the accuracy and adequacy of responses generated by three state-of-the-art large language models (LLMs) Claude Opus 4.6, Gemini 3 Pro, and ChatGPT GPT-5.4 to 50 frequently asked English-language questions about hyperprolactinemia and prolactinoma, as evaluated by three independent endocrinologists.Methods: Questions compiled from outpatient practice were presented using a zero-shot, plain-query approach through the publicly available consumer web interfaces. Responses were independently evaluated by three blinded endocrinologists using a 6-point Likert scale for accuracy and a 5-point Likert scale for adequacy. For each question, the three evaluators’ ratings were averaged to a question-level score, producing 50 question-level scores per model. Inter-model comparisons used the Friedman test with Bonferroni-corrected Wilcoxon signed-rank post hoc tests (adjusted α = .017); within-model comparisons across thematic categories used the Kruskal–Wallis test.Results: Claude achieved the highest mean accuracy (5.76 ± 0.64; 95% CI 5.58–5.93) and adequacy (4.52 ± 0.52; 95% CI 4.38–4.67), followed by Gemini (accuracy 5.58 ± 0.73; adequacy 4.31 ± 0.63) and ChatGPT (accuracy 5.39 ± 0.90; adequacy 4.06 ± 0.82). The Friedman test was significant for both metrics (accuracy χ² = 50.54; adequacy χ² = 63.51; both p < .001). All three pairwise comparisons were significant after Bonferroni correction (all p < .001), following the order Claude > Gemini > ChatGPT, with large effect sizes (r = .63 - .72). The pregnancy period category yielded the lowest scores across all models.Conclusion: These findings demonstrate the capacity of LLMs to generate clinically accurate, guideline-concordant English-language content regarding prolactinoma under optimal linguistic conditions. However, because the evaluation was limited to English-language queries, the results cannot be generalized to non-English-speaking populations, and their use as patient education tools requires direct comprehension and readability studies in the target language.

More from our Archive