Disease-specific safety risks in oncology large language models: A multi-axis evaluation across solid tumor subtypes.
Yan Leyfman, Connor Yost, Helena S. Coloma, Muskan Joshi, Taha Kassim Kassim Dohadwala, Soumiya Nadar, Harashita Vallabhaneni, Diksha Sanjana Pasnoor, Chandler H. Park, Arturo Loaiza-Bonilla19
Background: Large language models (LLMs) are increasingly explored for oncology clinical decision support; however, safety risks related to hallucinations and guideline misalignment remain poorly characterized in solid tumors, where management is highly dependent on biomarker status, disease state, and multimodality sequencing. Aggregate performance metrics may obscure disease-specific vulnerabilities that are critical for safe clinical deployment. Methods: We curated 186 solid-tumor tumor-board vignettes spanning five disease domains: breast cancer (n=50), gastrointestinal cancers (n=50), genitourinary cancers (n=30), CNS metastases (n=50), and gynecologic oncology (n=50). Each vignette was evaluated using three configurations: (1) an unconstrained LLM, (2) an NCCN-anchored retrieval-augmented generation (RAG) system, and (3) a literature-anchored RAG system. Two board-certified oncologists independently scored outputs using a modified Generative Performance Score (mGPS; −1 to +1), incorporating guideline concordance (0.00–1.00) and hallucination penalties (0.00 to −1.00). Overall safety disparity was conservatively assigned as the maximum severity across axes and classified as low, intermediate, high, or severe. Readability and rationality were scored separately (Likert 1–5). Results: Safety-aligned performance varied markedly by solid tumor subtype. Breast cancer outputs were predominantly low-to-intermediate risk (88%), with high-disparity cases driven primarily by biomarker-dependent guideline misalignment. GI cancers demonstrated increased vulnerability (32% high disparity), reflecting multidisciplinary complexity and biomarker omission. CNS metastases (80%) and gynecologic oncology (70%) exhibited the highest proportions of high-disparity outputs, frequently driven by combined hallucination and guideline failures in multimodality or rare-disease contexts. Across all subtypes, NCCN-anchored RAG improved mean mGPS and reduced hallucination penalties compared with unconstrained and literature-anchored systems but did not eliminate high-risk failures. Readability remained moderate to high across systems and showed poor correlation with safety, frequently masking clinically unsafe recommendations. Conclusions: In solid tumors, LLM safety is highly disease-dependent and strongly influenced by evidence-source constraints. Guideline-anchored retrieval improves safety but is insufficient in complex solid tumor settings. Fluent presentation frequently masks safety-critical failures. These findings support the need for disease-specific, multi-axis safety evaluation frameworks and deployment guardrails prior to clinical use of LLM-based oncology decision support.