Benchmarking of AI and Radiologists for Indeterminate Lung Nodule                     Malignancy Risk Estimation on Screening CT: The LUNA25 Challenge

doi:10.1148/ryai.260179

DOI: 10.1148/ryai.260179 ISSN: 2638-6100

Benchmarking of AI and Radiologists for Indeterminate Lung Nodule Malignancy Risk Estimation on Screening CT: The LUNA25 Challenge

Dré Peeters, Bogdan Obreja, Noa Antonissen, Zaigham Saghir, Ugo Pastorino, Mario Silva, Geertruida H. de Bock, Hester Gietema, Fergus Gleeson, Marjolein A. Heuvelmans, Stephen Lam, Geert Litjens, Firdaus Mohamed Hoesein, Cornelia Schaefer-Prokop, Ernst Scholten, Annemiek Snoeckx, Erik H. F. M. van der Heijden, Rozemarijn Vliegenthart, Mathias Prokop, Colin Jacobs,

Show PDF Cite

Purpose To compare the performance of an artificial intelligence (AI) system with that of radiologists for estimating malignancy risk of indeterminate-size nodules (5–15 mm) on low-dose CT (LDCT) within a standardized and transparent evaluation framework. Materials and Methods Teams participating in the AI study had access to a public dataset of 555 malignant and 5608 benign nodules on 4069 baseline LDCT scans from the National Lung Screening Trial (NLST) to develop AI systems. External testing was performed on 156 malignant and 312 benign size-matched nodules, all of indeterminate size, from 463 baseline scans collected from three large European lung cancer screening trials between 2004 and 2018, and the best-performing AI system (based on area under the receiver operating characteristic curve [AUC]) was selected. An observer study was conducted in which radiologists assessed 300 randomly selected nodules (100 malignant, 200 benign) from the external test set. Radiologists categorized nodules as low, intermediate, or high-risk, and the ≥ intermediate-risk threshold (intermediate or high-risk) was used to define a positive test. The selected AI system was compared with radiologists on this subset using the AUC. Results The selected AI system demonstrated superior performance to the 65 radiologists’ average (AUC, 0.78 (95% CI: 0.73, 0.84) vs 0.70 (95% CI: 0.65, 0.74), P = .001). When using the ≥ intermediate-risk threshold, the AI system correctly classified 12% more malignant nodules at matched specificity and yielded 20% fewer false positives at matched sensitivity. Conclusion The selected AI system was superior to radiologists in estimating malignancy risk of indeterminate lung nodules on LDCT. ©RSNA, 2026

Outline

Benchmarking of AI and Radiologists for Indeterminate Lung Nodule Malignancy Risk Estimation on Screening CT: The LUNA25 Challenge

More from our Archive