DOI: 10.1200/cci-25-00361 ISSN: 2473-4276

Can Small Open-Source Language Models With Retrieval-Augmented Generation Match GPT-4 Performance in Breast Cancer Clinical Decision Support?

Chanhee Park, In Hae Park, Minhyuk Kim, Youngjoon Jang, Heuiseok Lim, Yuna Hur

PURPOSE

The rapidly evolving breast cancer treatment landscape creates significant information synthesis challenges for clinicians. We evaluated whether small open-source large language models (LLMs) augmented with retrieval-augmented generation (RAG) could match proprietary model performance for clinical guideline queries.

METHODS

We developed a domain-specialized RAG pipeline using HTML-structure-preserving chunking of 1,356 ASCO breast cancer guideline documents. Five LLMs were each evaluated with and without RAG: GPT-4-turbo, GPT-3.5-turbo, Qwen2.5-14B (14 billion parameters), LLaMA3-8B, and OpenBioLLM-8B. Performance was assessed using 98 expert-curated question-answer-context triplets across seven breast cancer categories. Evaluation used both rubric-based scoring (six metrics: fluency, relevance, reliability, consistency, clarity, and clinical impact) and exhaustive pairwise ranking by GPT-4-turbo as judge. Human validation was conducted with 15 practicing oncologists on a 10-query subset.

RESULTS

RAG-enhanced Qwen2.5-14B achieved mean rubric scores of 3.77 versus 3.96 for GPT-4-turbo and pairwise ranking performance of 0.72 versus 0.81 (normalized scale). Although absolute rubric gains were modest (0.02-0.05 on a five-point scale), relative improvements in head-to-head win rates ranged from 16% to 46%. Human expert scores confirmed RAG superiority but were consistently more conservative than LLM judge scores (mean 3.81 v 4.12 across all metrics). Optimal retrieval used top-5 contexts; performance degraded sharply at higher context volumes.

CONCLUSION

Small open-source LLMs with optimized RAG can approach state-of-the-art proprietary model performance for clinical decision support. This approach enables scalable, cost-effective, privacy-preserving deployment without recurrent fine-tuning, suggesting potential for real-world clinical implementation on single-graphics processing unit infrastructure under expert supervision.

More from our Archive