DOI: 10.4103/ahstj.ahstj_25_25 ISSN: 3117-5422

Cracking the Code: Performance Convergence and Reasoning Blind Spots of Large Language Models on Alzheimer’s Questions in Medical Licensing Examinations

Hany A. Elkattawy, Miral T. Balfas, Shahad M. Alnassar, Amal Ahmad Mohsen, Farah Khalid Rajab

Abstract

Background:

Artificial intelligence (AI), particularly large language models (LLMs), is increasingly used in medical education and examination preparation. Recent studies suggest that frontier LLMs may have reached near-ceiling performance on standardized medical benchmarks; however, less is known about whether such performance converges across models and whether shared reasoning vulnerabilities persist, particularly in clinically complex domains such as Alzheimer’s disease.

Aims and Objectives:

This study aimed to evaluate the performance of three frontier LLMs—ChatGPT-4 (OpenAI), Gemini 2.5 Flash (Google), and Claude Sonnet 4 (Anthropic)—on Alzheimer’s disease-related multiple-choice questions drawn from international medical licensing examinations. A secondary objective was to assess the guideline alignment of model explanations to identify differences in reasoning quality beyond accuracy.

Materials and Methods:

A total of 49 Alzheimer’s disease-focused multiple-choice questions were curated from peer-reviewed preparatory resources and represented five international examinations: USMLE, SMLE, MCCQE, IFOM, and PNA. Each model answered all questions under standardized, independent conditions. Accuracy was recorded, and explanations were evaluated using a guideline alignment score (0–1) based on consistency with authoritative clinical guidelines (NICE, WHO, and AAN). Interrater reliability was assessed using Cohen’s kappa.

Results:

All three models achieved identical overall accuracy of 98% (48/49 correct). Each model made the same diagnostic error on a single USMLE vignette, misclassifying Alzheimer’s disease as normal pressure hydrocephalus, revealing a shared reasoning blind spot. Accuracy was uniform across clinical domains, except for diagnosis (94.1%). Guideline alignment scores differed modestly, with Claude Sonnet 4 demonstrating the highest alignment (0.95), followed by ChatGPT-4 (0.92) and Gemini 2.5 Flash (0.88).

Conclusion:

Frontier LLMs demonstrate performance convergence on structured Alzheimer’s disease examination questions, achieving near-identical accuracy. However, convergence also extends to shared diagnostic errors, highlighting systematic reasoning vulnerabilities. Incorporating guideline alignment reveals meaningful differences in explanatory quality despite similar accuracy. These findings underscore the importance of evaluating not only correctness but also reasoning quality and guideline fidelity when integrating LLMs into medical education and clinical decision support.

More from our Archive