WiNGPT-32B: An Open-Source, Locally Deployable LLM for RECIST Assessment via Chained Task Execution Using Radiology Report Text

doi:10.3390/diagnostics16132020

DOI: 10.3390/diagnostics16132020 ISSN: 2075-4418

WiNGPT-32B: An Open-Source, Locally Deployable LLM for RECIST Assessment via Chained Task Execution Using Radiology Report Text

Lingyun Wang, Lu Zhang, Yaping Zhang, Lin Zhang, Xueqian Xie

Objective: The objective of this study was to construct a large language model (LLM) for the Response Evaluation Criteria in Solid Tumors (RECIST) assessment using exclusively longitudinal radiology report text. Methods: This study included 258 patients with solid tumors, encompassing 2065 longitudinal CT/MRI examination time points. We developed WiNGPT-32B, an open-source and locally deployable LLM, by infusing it with domain-specific medical knowledge and optimizing it via knowledge distillation, using GPT-4 as the teacher model. Central to its architecture is the Chained Task Execution (CTE) framework, which structures RECIST assessment into four modular components: lesion diameter extraction, sum of longest diameter computation, tumor response classification, and report generation. Model performance (accuracy, recall, precision, and F1 score) was benchmarked against GPT-4 and a single radiologist, utilizing the consensus of three independent radiologists as the reference standard. Results: The number of patients with imaging time points was 212 (82.2%) with 4–10, 36 (13.9%) with 11–20, and 10 (3.9%) with >20 time points. For target lesions, the successful extraction rate of WiNGPT-32B was 0.934 (95% CI: 0.922–0.944), which was slightly higher than that of GPT-4 0.920 (0.907–0.931; p = 0.083). In five-category RECIST classification (complete response, partial response, stable disease, progressive disease, and not evaluable), WiNGPT-32B achieved an overall accuracy of 0.805 (0.786–0.823), significantly higher than GPT-4 (0.699, 0.678–0.720; p < 0.001) but lower than the radiologist (0.915, 0.901–0.928; p < 0.001). For progressive disease, WiNGPT-32B had an F1 score of 0.841 (0.813–0.870), significantly outperforming GPT-4’s 0.755 (0.720–0.790), and approaching the radiologist’s 0.922 (0.902–0.942). Conclusions: WiNGPT-32B demonstrates the feasibility of a text-only, open-source LLM with the CTE framework for longitudinal RECIST assessment, with promising performance in detecting disease progression.

Outline

WiNGPT-32B: An Open-Source, Locally Deployable LLM for RECIST Assessment via Chained Task Execution Using Radiology Report Text

More from our Archive