DOI: 10.1200/jco.2026.44.19_suppl.22 ISSN: 0732-183X

Semantic interpretation of prior therapy eligibility in oncology protocols: A comparative analysis of proprietary and open-source large language models (LLMs).

Minh Tran, Frank Po-Yen Lin

22

Background: Efficient patient accrual in oncology trials is often impeded by the manual burden of interpreting complex eligibility criteria. A critical rate-limiting step in prescreening is the accurate classification of “prior lines of therapy”, a task frequently complicated by substantial semantic heterogeneity and linguistic ambiguity within protocol texts. We evaluated the reasoning performance of state-of-the-art LLMs in automating the extraction and binary classification of these criteria. Methods: To assess LLM reasoning regarding prior therapy requirements, we curated a dataset of eligibility clauses from ClinicalTrials.gov (2020–2023). Statements were extracted and annotated by a medical oncologist into two groups of binary states: “Must Have Received” (MR) and “Must Not Have Received” (MNR) specific drug classes, establishing the ground truth. We compared four LLMs: GPT-4o, GPT-5, Ministral-14b, and Qwen3-32b. This Natural Language Processing task utilised zero-shot and few-shot section-aware prompting (differentiating inclusion vs exclusion criteria) with a temperature of 0.0 to ensure deterministic output. Performance was quantified via accuracy, F1 score, Cohen’s kappa (κ), and Matthews Correlation Coefficient (MCC). Operational viability was assessed via the output parse success rate, defined as the capacity to generate strict, machine-readable binary labels without schema-breaking hallucinations. Results: The final corpus comprised 1,360 human-annotated prior therapy criteria statements (52% inclusion, 48% exclusion, 44.4% MR, 55.6% MNR) encompassing 70 distinct drug classes. GPT-5 demonstrated superior semantic reasoning capabilities, achieving the highest classification accuracy (92.2%) and F1 score (0.921), with near-perfect inter-rater agreement against the human gold standard (κ=0.843; MCC=0.844). However, GPT-5 exhibited a 19.5% parse failure rate, indicating challenges with strict output normalisation. The open-source model Qwen3-32b demonstrated exceptional operational stability with a 98.9% parse success rate while maintaining robust accuracy (88.9%). Ministral-14b yielded inferior performance (accuracy 77.0%, F1=0.772), driven primarily by a high false-negative prediction rate. While GPT-5 offers peak reasoning performance, Qwen3 provides the operational stability requisite for high-throughput automated pipelines. Conclusions: Selected LLMs demonstrate near-expert accuracy in resolving the semantic ambiguity of prior therapy criteria within oncology protocols. This suggests that integrating LLM-driven prescreening into clinical trial workflows is a viable strategy to optimise participant matching and accelerate drug development.

More from our Archive