Large Language Model-Generated Differential Diagnoses in Radiology Education: Comparison with a Standard Casebook

doi:10.3390/diagnostics16132009

DOI: 10.3390/diagnostics16132009 ISSN: 2075-4418

Large Language Model-Generated Differential Diagnoses in Radiology Education: Comparison with a Standard Casebook

Pauline Chapellier, Jacopo Ferrari, Thomas Saliba, Patrick Jeltsch, Mustafa Mohamed, Sofyan Jankovski, Gorun Ilanjian, Marta Epis, Virginia Pansini, Federica Bragaglia, Alessandro Agostinelli, Krismalyn Caringal, Lachezar Lalov, David C. Rotzinger, Guillaume Fahrni

Show PDF Cite

Background/Objectives: Large language models (LLMs) are increasingly explored for radiology education, but their role in differential diagnosis learning remains under-investigated. This study evaluates the perceived usefulness of LLM-generated differential diagnoses compared with a standard radiology casebook. Methods: In this multi-center study, radiology trainees at junior (years 1–2) and advanced (years 3–5) levels evaluated 225 cases from a gold-standard casebook spanning nine subspecialties. Participants ranked the usefulness of their personal clinical experience, the casebook, and LLM teaching, and rated the LLM output using a five-point Likert scale across Clarity, Trust, Differential Usefulness, and Diagnostic Usefulness. Results: Thirteen trainees (4 junior, 9 advanced) completed 2425 evaluations. Overall, the casebook was rated most useful (mean rank 1.7 ± 0.2), followed by LLM teaching (1.8 ± 0.3) and personal experience (2.4 ± 0.2; p = 0.023), with no significant difference between LLM and Textbook (p = 0.438). Junior trainees favored LLM teaching more than advanced trainees (first-rank 66.6% vs. 22.1%; p = 0.037). Across subspecialties, the casebook consistently ranked highest, with LLM slightly lower and experience lowest. LLM teaching received high ratings for Clarity (4.4 ± 0.3), Trust (4.3 ± 0.3), Differential Usefulness (4.3 ± 0.4), and Diagnostic Usefulness (4.2 ± 0.4), with no statistically significant difference between domains (p = 0.149). Conclusions: LLM-generated differential diagnoses are clear, trustworthy, and perceived as highly useful for education, nearing the perceived value of a standard casebook, especially for junior trainees. While textbooks remain essential, LLMs hold promise as supplementary tools, but caution is needed due to potential inaccuracies and their inability to replicate image-based teaching.

Outline

Large Language Model-Generated Differential Diagnoses in Radiology Education: Comparison with a Standard Casebook

More from our Archive