Comparison of Large Language Model‐Based Systems and Prompt Engineering for Internal Medicine Clinical Pharmacy Cases

doi:10.1002/jac5.70235

DOI: 10.1002/jac5.70235 ISSN: 2574-9870

Comparison of Large Language Model‐Based Systems and Prompt Engineering for Internal Medicine Clinical Pharmacy Cases

Samuel S. Yang, Clement E. Ng, Hyunuk Seung, Sean Kelly

Show PDF Cite

ABSTRACT

Background

Large language models (LLMs) are an application of artificial intelligence and generate responses to user inquiries that vary in accuracy and completeness. Methods of improving LLM response quality are poorly evaluated in the context of clinical pharmacy practice.

Methods

This was a single‐center, observational, prospective study conducted through internal medicine admitting services. A clinical pharmacy specialist developed 50 case questions reflective of usual practice, which were processed through LLM‐based systems in a two‐by‐two factorial design. The first variable was the selection of a general‐purpose LLM, ChatGPT 4o (GPT, OpenAI Inc., San Francisco, California, USA), or a health care provider domain‐specific LLM with retrieval‐augmented generation (RAG) features, OpenEvidence (OpenEvidence, Miami, Florida, USA). The second variable was the inclusion of a prompt engineering template with specific instructions and parameters to refine the system output. Responses were evaluated by two pharmacists and reconciled with a third pharmacist. The primary endpoint was a composite of response accuracy and completeness. Secondary outcomes included accuracy, completeness, reference validity, reproducibility, and extraneous information.

Results

Logistic regression modeling demonstrated no statistically significant interactions between the two LLM‐based systems and use of a prompt engineering template for accuracy and completeness. Predicted probabilities for meeting the primary outcome were as follows: GPT no Template 0.54, GPT with Template 0.60, OpenEvidence no Template 0.64, and OpenEvidence with Template 0.52. OpenEvidence reference validity was higher than GPT regardless of prompt engineering template use ( p < 0.001 for all comparisons).

Conclusion

Neither the use of health care provider domain‐specific LLM with RAG nor a prompt engineering template was found to improve LLM‐based system accuracy and completeness. OpenEvidence's ability to cite relevant and correct references more frequently than GPT shows promise for practice applications. There is a need for further evaluation of methods to improve LLM utility as artificial intelligence becomes further integrated in clinical pharmacy practice.

Trial Registration: University of Maryland, Baltimore IRB; HP‐00112497.

Outline

Comparison of Large Language Model‐Based Systems and Prompt Engineering for Internal Medicine Clinical Pharmacy Cases

ABSTRACT

Background

Methods

Results

Conclusion

More from our Archive