Evaluation of ChatGPT, Gemini, and OpenEvidence in Obstetric and Gynecologic Clinical Decision Scenarios

doi:10.1055/a-2899-0123

DOI: 10.1055/a-2899-0123 ISSN: 1869-0327

Evaluation of ChatGPT, Gemini, and OpenEvidence in Obstetric and Gynecologic Clinical Decision Scenarios

Arif Onur Atay, Feride Atay, Samican Ozmen, Mucahit Furkan Balci

Abstract Background Clinicians frequently face questions that require rapid, evidence-based answers. Artificial intelligence (AI) tools are increasingly used for this purpose, yet their reliability for clinical decision-making remains uncertain. This study compared two generative large language model (LLM) systems (ChatGPT and Gemini) and a retrieval-supported clinical platform (OpenEvidence) to determine which provides the most reliable, clear, and clinically applicable information in obstetrics, gynecology, and urogynecology. Methods A cross-sectional comparative design was used to evaluate ChatGPT (GPT-5), Gemini (Gemini 2.5), and the retrieval-supported platform OpenEvidence. Twenty-four clinical questions across three subspecialties were independently assessed by two blinded specialists using the Expert-Adapted DISCERN (EA-DISCERN) tool, which rates 12 quality domains on a five-point scale. Mean ± SD scores were compared across systems and clinical domains using repeated-measures analysis. Results OpenEvidence achieved the highest mean total score (54.0 ± 2.3), outperforming Gemini (50.3 ± 2.4) and ChatGPT (48.7 ± 2.4) (p < 0.001). OpenEvidence scored significantly higher in evidence-based domains; clinical accuracy, guideline consistency, completeness, transparency, and reliability across all fields. As of this writing, Gemini ranked between the two, showing a modest advantage over ChatGPT in rationale explanation and evidence transparency, while both generative models scored higher in language fluency and readability. Overall, total EA-DISCERN scores ranked OpenEvidence highest, followed by Gemini, then ChatGPT. Inter-rater reliability for the total score was ICC[2,1] (absolute agreement = 0.391). Conclusions OpenEvidence provided more guideline-aligned and transparent responses, whereas ChatGPT and Gemini were generally more fluent and readable. For OB/GYN clinicians, retrieval-supported platforms may be more suitable for point-of-care verification, while generative models should be used more cautiously and with clinician oversight.

Outline

Evaluation of ChatGPT, Gemini, and OpenEvidence in Obstetric and Gynecologic Clinical Decision Scenarios

More from our Archive