DOI: 10.1001/jamanetworkopen.2026.20939 ISSN: 2574-3805

Screening for Missed Opportunities for Diagnosis in the ED Using eTriggers and Large Language Models

Clifford M. Marks, Sean Gibney, Bryan Stenson, Deesha Sarma, Cynthia Gaudet, Haadi Mombini, Thomas A. Buckley, Mario Keko, Larry A. Nathanson, Laura G. Burke, Nathan I. Shapiro, Jonathan L. Burstein, Shamai A. Grossman, Anika Parab, Alexander T. Janke, Arjun K. Manrai, Richard A. Taylor, Carlo L. Rosen, Adam Rodman, Adrian D. Haimovich

Importance

Emergency department (ED) quality review often uses administrative electronic triggers (eTriggers), but yields on detecting missed opportunities for diagnosis (MODs) are low. A commercial large language model (LLM) may help screen for MODs, yet evaluation data in real-world cohorts remain limited.

Objective

To evaluate LLMs for identifying MODs in ED eTrigger cohorts.

Design, Setting, and Participants

This retrospective diagnostic study of 2 eTrigger cohorts, ED discharge with return hospital admission within 72 hours and ED admission to the floor with intensive care unit (ICU) escalation within 24 hours, was conducted from April 2015 through March 2025 across 9 EDs (2 academic and 7 community) in 1 US health system. Samples included 200 encounters from the 72-hour return cohort and 100 encounters from the floor-to-ICU cohort; each case was adjudicated by 2 emergency physicians using a review process based on the Safer Dx framework.

Exposures

Cases were evaluated by Claude Sonnet 4, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Pro, GPT-5, and GPT-5 mini.

Main Outcomes and Measures

Main outcomes were sensitivity, specificity, positive predictive value, negative predictive value, area under the receiver operating characteristic curve (AUC), and reviewer-reviewer and reviewer-model concordance.

Results

Among 300 sampled encounters, 12 were excluded, leaving 288 analyzed encounters (median [IQR] age, 69 [54-79] years; 135 female [46.9%]) with 39 MODs (13.5%), including 21 of 191 (11.0%) in the 72-hour return cohort and 18 of 97 (18.6%) in the floor-to-ICU cohort. Interrater agreement was 81.9% (95% CI, 77.4%-86.1%), with Gwet AC1 of 0.77 (95% CI, 0.70-0.83). In the 72-hour return cohort, model sensitivity ranged from 42.9% (95% CI, 24.5%-63.5%) for GPT-5 mini to 85.7% (95% CI, 65.4%-95.0%) for Claude Sonnet 4, specificity from 55.9% (95% CI, 48.4%-63.1%) for Claude Sonnet 4 to 82.9% (95% CI, 76.6%-87.9%) for GPT-5 mini, and AUC from 0.65 (95% CI, 0.53-0.77) for GPT-5 mini to 0.73 (95% CI, 0.61-0.85) for Claude Sonnet 4. In the floor-to-ICU cohort, sensitivity ranged from 5.6% (95% CI, 1.0-25.8%) for GPT-5 mini to 55.6% (95% CI, 33.7%-75.4%) for Claude Sonnet 4, specificity from 64.6% (53.6%-74.2%) for Claude Sonnet 4 to 97.5% (95% CI, 91.2%-99.3%) for GPT-5 mini, and AUC from 0.57 (95% CI, 0.46-0.67) for GPT-5 mini to 0.82 (95% CI, 0.73-0.91) for GPT-5. Across cohorts, LLMs showed similar discrimination but different sensitivity-specificity tradeoffs; Claude Sonnet 4 generally favored higher sensitivity, whereas GPT-5 mini favored higher specificity.

Conclusions and Relevance

In this diagnostic study of 2 ED eTrigger cohorts, model performance varied by cohort, with LLMs showing similar discrimination but different binary thresholds. These findings suggest that evaluation within the review workflow is needed before implementation and that reviewer-like concordance captures a distinct dimension of model behavior from discrimination.

More from our Archive