DOI: 10.1093/ajrccm/aamag286.293 ISSN: 1073-449X

D28-11 Zero-Shot Large Language Models for Detection and Characterization of Acute Pulmonary Embolism From Unstructured Radiology Reports

S Hassan, N Dao, J Webb, A Ashok, G Koybasi, A Gogolashvili, G Piazza, J Loscalzo, A R El Boueiz, M H Cho, M J Cuttica, R Mylvaganam, B Bikdeli, A B Waxman, F N Rahaghi, G R Washko

Abstract

Rationale

Automated detection of pulmonary embolism (PE) from radiology reports is essential for clinical registries, research cohorts, and quality improvement initiatives. Traditional NLP approaches require extensive labeled training data and demonstrate limited generalizability across imaging protocols. Large language models (LLMs) offer zero-shot classification without task-specific training, but rigorous validation across diverse clinical contexts and PE subtypes remains lacking.

Methods

We evaluated 10 machine learning models for PE detection from CT radiology reports at a large academic medical center (2015-2024). Models included traditional machine learning (SVM, regular expressions), fine-tuned BERT models (ClinicalBERT, BioBERT, Clinical-Longformer), and five zero-shot LLMs (gemma2:9b, qwen2.5:7b, mistral:7b, llama3.1:8b, llama3.2:3b) using predetermined prompts fixed a priori. From 234,000 reports, 1,463 were selected for physician review across six CT modalities. Five physicians established ground truth for PE presence and four descriptors: anatomic location, acuity, laterality, and clot burden. Inter-rater reliability was assessed on 100 reports reviewed by all physicians. Performance metrics included F1 score, sensitivity, specificity, and accuracy with bootstrap confidence intervals.

Results

The cohort included 491 PE-positive (33.6%) and 972 PE-negative reports. Inter-rater agreement was excellent (mean kappa 0.902). Zero-shot LLMs achieved superior performance: gemma2:9b attained 96.3% accuracy, 95.3% sensitivity, 96.8% specificity, and F1 0.945 (95%CI: 0.930-0.958). Four LLMs achieved F1 scores exceeding 0.93 without any training. Fine-tuned ClinicalBERT achieved 93.9% accuracy and F1 0.91. The best zero-shot LLM produced 44% fewer false positives than the best fine-tuned BERT model. Our previously published traditional SVM and regex models achieved lower accuracy (F1 0.70-0.71) and critically failed to generalize beyond their native CT PE Protocol training data, with F1 scores dropping by 40% when applied to other CT modalities. In contrast, zero-shot LLMs maintained consistent accuracy across all six CT types (variance <0.0004). Contrary to historical NLP challenges, subsegmental PE detection (>93% sensitivity) was not inferior to central PE detection. Chronic PE showed reduced detection sensitivity (70-84%) compared to acute PE (95-100%). LLM confidence levels correlated strongly with accuracy (r = 0.95-0.98). For descriptor classification, LLMs achieved 85-94% accuracy for laterality and 59-90% for acuity. All models were deployed locally on a standard workstation, supporting HIPAA-compliant implementation.

Conclusions

Zero-shot LLMs match or exceed fine-tuned BERT models for PE detection without requiring task-specific training, while demonstrating superior generalization across diverse CT imaging protocols. These models additionally extract clinically relevant PE descriptors with variable accuracy. Zero-shot LLMs represent a paradigm shift enabling rapid deployment across institutions without local training data requirements.

This abstract is funded by: 1R01HL164717-01

More from our Archive