DOI: 10.1111/echo.70525 ISSN: 0742-2822

Application of LLMs in CAD‐RADS Classification and Patient Management

Piotr Tarkowski, Giuseppe Muscogiuri, Davide Casartelli, Francesca Coraducci, Fatma Sassi, Jessica Usai, Grzegorz Staśkiewicz, Elżbieta Siek, Jakub Byczkowski, Răzvan‐Andrei Licu, Marianna Mirchuk, Marco Guglielmo, Sandro Sironi

ABSTRACT

Purpose

To evaluate the capability of four publicly available Large language models (LLMs) to assign Coronary Artery Disease‐Reporting and Data System (CAD‐RADS) scores and provide patient management recommendations based on synthetic coronary CT angiography (CCTA) reports.

Methods

Four LLMs (ChatGPT 4o, Claude 3.7, DeepSeek, and Gemini 2.5 Pro) were tasked with analyzing reports and suggesting next steps. Prompts were framed from the perspective of both a cardiologist and a radiologist. Agreement with a human reference standard was assessed using weighted Cohen's kappa, Fleiss' kappa, and Krippendorff's alpha for CAD‐RADS scoring, and unweighted Cohen's kappa for management recommendations. A Bayesian Wilcoxon signed‐rank test was performed to assess directional bias.

Results

Performance variations were observed across LLMs and prompt identities. Claude‐3.7 achieved almost perfect agreement for CAD‐RADS scoring (κ = 0.997) regardless of prompt identity, Gemini similarly achieved almost perfect agreement (radiologist: κ = 0.962; cardiologist: κ = 0.990), ChatGPT demonstrated almost perfect agreement when prompted as a radiologist (κ = 0.896) but only substantial agreement when prompted as a cardiologist (κ = 0.715). DeepSeek showed the lowest overall performance (radiologist: κ = 0.637; cardiologist: κ = 0.768). By category, all LLMs correctly identified CAD‐RADS 0, whereas higher‐grade stenosis (4A/4B) remained the most challenging, with non‐Claude models showing low‐to‐null agreement in some configurations. The LLMs' accuracy in proposing further management was considerably lower than their scoring accuracy, with CAD‐RADS 3 showing the greatest variability in management recommendations across models and between human specialists. Furthermore, both CAD‐RADS scoring and management recommendations varied depending on the professional identity specified in the prompt.

Conclusion

While LLMs demonstrated reliable scoring performance for lower‐grade CAD‐RADS categories (0‐2), agreement was substantially reduced for higher‐grade stenosis categories (4A/4B) and non‐diagnostic studies, which could pose risks to patients. Their current ability to generate dependable clinical management recommendations is limited.

More from our Archive