A comparative analysis of frontier large language model (LLM) performance on standardized knowledge assessments and real world scenarios in genitourinary (GU) malignancies.
Sameer S. Deshmukh, Aaqid Syed, Oluwatayo Adeoye, Furhan Yunus, Umang Swami, Irbaz Bin Riaz, Winston Tan, Neeraj Agarwal, Guru P. Sonpavde, Arnab Basu21
Background:
Large language models (LLMs) are becoming an increasing source of medical information for both patients and providers. Clinical decision making is occasionally complex and may use factors outside of traditional guidelines. We conducted a comparative performance evaluation of frontier models on standardized genitourinary (GU) oncology knowledge assessments derived from the American Society of Clinical Oncology Self-Evaluation Program 2025 (ASCO SEP 2025) in addition to challenging questions developed by GU oncology experts.
Methods:
Three frontier LLMs were selected: ChatGPT-5.2 (OpenAI, San Francisco, CA, USA), Claude Opus 4.6 (Anthropic, San Francisco, CA, USA), and Gemini 3 Pro (Google DeepMind, Mountain View, CA, USA). Model selection was based on publicly reported performance on “Humanity’s Last Exam,” a benchmark of advanced scientific reasoning (Scale AI leaderboard Jan-Feb 2026). The OpenEvidence (OpenEvidence Inc., MA, USA) search engine was also selected for evaluation. All models were asked 145 ASCO SEP 2025 GU oncology questions and 12 real world challenging scenarios using an uniform prompt. Subject matter experts peer reviewed LLM generated responses to challenging questions and credit was given only for correct rationale and option selection.
Results:
All models demonstrated high and comparable accuracy on Standardized knowledge assessment, with Gemini 3 Pro leading numerically at 97.24% (95% CI, 93.1% – 99.0%), OpenEvidence followed with an accuracy of 94.48% (95% CI, 89.5% – 97.3%), while Chat GPT -5.2 and Claude OPUS 4.6 scored 92.41% (95%, CI 86.9% – 95.7%) and 79.31% (95% CI, 71.9% – 85.1%) respectively. On challenging real world scenarios, ChatGPT-5.2 demonstrated the highest numerical accuracy at 100% (95% CI, 75.8% – 100%), Gemini 3 Pro and Open Evidence scored 75.0% (95% CI, 46.8%–91.1%) each, followed by Claude Opus 4.6 at 66.7% (39.1–86.3).
Conclusions:
All current frontier AI models appear to provide accurate medical information and rationale on a large majority of guideline based medical knowledge assessments for GU oncology. Frontier AI models also appear to have emerging competence in highly challenging real-world scenarios. AI based tools may become a valuable and reliable tool for patients and providers.
Comparative LLM performance.