DOI: 10.1002/wjo2.70131 ISSN: 2095-8811

Blinded by the Bot: Benchmarking GPT and Gemini Against Human Authors in Otolaryngology Reviews

Sholem Hack, Rebecca Attal, Letizia Nitro, Cecilia Rosso, Anastasia Urbanelli, Antonio Mario Bulfamante, Luigi Angelo Vaira, Omar G. Ahmed, Masayoshi Takashima

ABSTRACT

Objective

To compare the quality of scientific review articles generated by two artificial intelligence systems, ChatGPT and Gemini, with those written by human authors in the field of otolaryngology.

Methods

Two otolaryngology topics, chronic rhinosinusitis and infantile subglottic hemangioma, were selected. For each topic, four AI‐generated reviews (GPT‐4.0 and Gemini 2.0; narrative and PRISMA‐style) and one human‐authored peer‐reviewed review were included, yielding a total of 10 manuscripts (8 AI‐generated, 2 human‐authored). A blinded panel of seven board‐certified otolaryngologists evaluated all manuscripts using a 5‐point Likert scale across seven domains: scientific accuracy, depth of content, citation quality, structure and organization, readability and tone, critical insight, and overall scientific quality. Group comparisons were performed using linear mixed‐effects models with random intercepts for reviewer and manuscript. Interrater reliability was assessed using Shrout–Fleiss intraclass correlation coefficients (ICC). Manual verification of AI‐generated references was conducted to assess citation accuracy and fabrication.

Results

Human‐authored manuscripts received the highest ratings across all domains (overall quality 4.50 ± 0.76). GPT‐4.0 demonstrated moderate performance (2.71 ± 1.46 overall), while Gemini 2.0 scored lowest (2.14 ± 1.01). Mixed‐effects modeling demonstrated significant group differences across all domains ( p  ≤ 0.008). Citation quality showed one of the largest between‐group differences and strong reliability [ICC(2,1) = 0.68; ICC(2,7) = 0.94]. Manual verification of 123 AI‐generated references revealed high citation accuracy for GPT‐4.0 (89.6% fully accurate; 0% fabricated) compared with Gemini 2.0 (71.7% fully accurate; 23.9% fabricated). Reviewers misclassified 50% of GPT‐4.0 manuscripts as human‐authored, correctly identified 93% of human‐authored manuscripts, and classified 86% of Gemini manuscripts as AI‐generated.

Conclusion

GPT generated fluent, stylistically strong reviews but remained significantly inferior to human‐authored manuscripts in analytical depth and citation integrity. Gemini 2.0 underperformed across all domains and demonstrated a substantial rate of fabricated citations. As large language models become integrated into academic workflows, transparent disclosure, structured fact‐checking, and human oversight remain essential to safeguard scientific reliability.

More from our Archive