DOI: 10.1200/jco.2026.44.19_suppl.15 ISSN: 0732-183X

Automated cancer data extraction using large language models: A scalable workflow for clinical documentation processing.

See Boon Tay, Mie Mie Aung, Brenda Tay, Jasmine Hui Wai Ling, Aaron Chuah, Han Jieh Tey, Joanne Wei Li Tan, Evelyn Yi Ting Wong, Ryan Shea Ying Cong Tan, Fuh-Yong Wong, Iain Tan, Wei Chong Tan

15

Background: Accurate and timely capture of key diagnosis variables including ICD-10-AM codes, histology, laterality and diagnosis date are essential for cancer registry reporting, research and service planning. Current curation workflows use manual review of clinical and histopathology documents for the approximately 1000 cancer patients diagnosed monthly at National Cancer Centre Singapore, resulting in substantial workload and data latency of up to six months. Scalable automation is therefore critically needed. Methods: We developed and evaluated an automated extraction pipeline using large-language models (LLMs) to generate structured cancer diagnosis data from unstructured clinical text. A structured prompt engineering framework incorporating multi-stage task decomposition, reference-grounded generation, and few-shot learning was designed to enforce strict medical coding constraints, anatomical consistency, and source hierarchy. An institutional in-house model (GPT-5) processed retrieved clinical documents to extract ICD-10-AM codes, histology, laterality and diagnosis dates. Results were independently adjudicated by a trained cancer informatician and oncologist. Based on initial adjudication, pre-defined criteria were developed to identify complex/ambiguous cases for manual review to enhance accuracy. Model performance was evaluated using accuracy, recall, precision and F1 score. Results: Two cohorts were analysed: Cohort A (360 patients, 444 cancer diagnoses across 30 selected cancers) diagnosed January 2018 to December 2024, Cohort B (359 patients, 415 diagnoses in an unselected cohort) diagnosed between January to April 2025. The pipeline processed each cohort within 136-144 minutes; 23.3% (84/360) and 23.1% (83/359) in Cohorts A and B respectively were flagged for human-in-the-loop review based on predefined complex/ambiguous case criteria. Model performance for cohort processed end-to-end without manual intervention is presented in Table 1. Conclusions: Our LLM-based extraction pipeline accurately captured cancer diagnosis information from clinical documentation, reducing manual workload and shortening data latency. Cohort B demonstrated superior performance, attributable to lower document volume and reduced clinical documentation complexity which improved signal clarity and extraction accuracy. Embedded quality control criteria incorporating human-in-the-loop effectively identified complex or ambiguous cases for expert review, supporting scalable cancer registry operations with strong potential for broader healthcare implementation.

Model performance.

Cohort
Overall Accuracy
ICD-10-AM
Histology
Laterality Diagnosis Date
Accuracy Recall Precision F1 score Accuracy Accuracy Accuracy
Cohort A
0.840 0.965 0.965 0.976 0.971 0.956 0.970 0.955
Cohort B
0.915 0.996 0.996 1.000 0.998 0.927 0.973 0.990

More from our Archive