Evaluating accuracy and reasoning capabilities of large language models for acute ischemic stroke management

doi:10.1136/jnis-2026-025429

DOI: 10.1136/jnis-2026-025429 ISSN: 1759-8478

Evaluating accuracy and reasoning capabilities of large language models for acute ischemic stroke management

Aymen Meddeb, Navid Bakhtiari, Ida Rangus, Leonard Fetscher, Bastien Leguellec, Felix Busch, Alexandre Doucet, Vi Tuan Hua, Fuong Verot-Nguyen, Laurentiu V Paiusan, Pierre F Manceau, Paolo Pagano, Mike Peter Wattjes, Solène Moulin, Laurent Pierot, Sébastien Soize

Show PDF Cite

Background

Timely and accurate treatment decisions in acute ischemic stroke (AIS) are critical, particularly for intravenous thrombolysis (IVT) and mechanical thrombectomy (MT). As many patients initially present to non-specialized centers, decision-making may be delayed or inconsistent. Large language models (LLMs) have the potential to support clinical triage by integrating complex clinical and imaging information. We evaluated the diagnostic accuracy and reasoning characteristics of LLMs for IVT and MT eligibility compared with expert clinicians and real-world decisions.

Methods

In this retrospective study, 80 AIS cases from two stroke centers were converted into structured clinical vignettes, including demographic, clinical, and imaging data. Four LLMs (DeepSeek R1, OpenAI o3 mini, Gemini 2.0, and LLaMA 3.3) and six stroke experts (two neurologists, four neuroradiologists) independently recommended treatment (IVT and/or MT). Ground truth was defined as the institutional treatment decision. Diagnostic accuracy was calculated separately for IVT and MT. A qualitative error analysis assessed reasoning patterns.

Results

DeepSeek R1 achieved the highest MT accuracy among all models and clinicians (MT 87%), with IVT accuracy of 78%. Overall, performance was higher for MT than IVT across all groups. Neurologists achieved 81% (MT) and 80% (IVT), while neuroradiologists achieved 84% (MT) and 76% (IVT). LLM explanations for MT decisions were largely clinically plausible but diverged from real-world choices, whereas IVT errors were predominantly related to incomplete guideline adherence.

Conclusions

LLMs demonstrated expert-level performance in AIS treatment decision-making, particularly for MT, with interpretable reasoning. These findings support further validation of LLM-based decision support systems in acute stroke triage, especially in remote settings.

Outline

Evaluating accuracy and reasoning capabilities of large language models for acute ischemic stroke management

Background

Methods

Results

Conclusions

More from our Archive