Hepatology e-consult responses generated by artificial intelligence demonstrate accuracy but require human oversight

doi:10.1097/hc9.0000000000000983

DOI: 10.1097/hc9.0000000000000983 ISSN: 2471-254X

Hepatology e-consult responses generated by artificial intelligence demonstrate accuracy but require human oversight

Holly K.T. Huang, Debra W. Yen, Michelle Y. Li, Gabrielle Jutras, Lisa X. Deng, Charles E. McCulloch, Mark J. Pletcher, Jennifer C. Lai, Bilal Hameed, Jin Ge

Show PDF Cite

Background:

Electronic consultations (e-consults) improve specialist access but burden providers. We developed LiVersa, a customized large language model (LLM) for liver diseases. We evaluated its performance in drafting hepatology e-consult responses and the equivalence between human and machine reviewers.

Methods:

LiVersa-generated responses for hepatology e-consults answered at the University of California San Francisco (UCSF) from January to March 2025. Using a 12-item rubric, 3 independent hepatologists and “LLM-as-a-judge” (OpenAI-o1) evaluated drafts against original responses. We tested equivalence between human reviewers and “LLM-as-a-judge” using two one-sided tests (TOST).

Results:

Among 61 e-consults, the most common categories were abnormal liver function tests (34%), hepatitis B (23%), and abnormal imaging (21%). LiVersa drafts demonstrated no differences from hepatologist responses in word count (284 vs. 264, p =0.47) and verbosity (24 vs. 25 words per sentence, p =0.44). Human reviewers rated 72% of drafts as reasonable starting points and 83% as providing appropriate case-specific recommendations; 10% contained misleading/incorrect information, and 3.4% posed a risk of severe harm. LiVersa performed better at avoiding misleading information and extraneous suggestions but scored lower on clinical equivalence, immediate usability, and comprehensiveness. LLM-based reviewers were more stringent than human reviewers, rating fewer drafts as clinically equivalent (27% vs. 48%) and more as potentially harmful (67% vs. 20%), with agreement on accuracy, precision, and comprehensiveness (mean difference 0.026–0.029; TOST p <0.05).

Conclusions:

Customized LLMs like LiVersa show promise for e-consult drafting but require human oversight. LLM-as-a-judge was more conservative than humans, supporting its role in rapid quality assurance during model updates.

Outline

Hepatology e-consult responses generated by artificial intelligence demonstrate accuracy but require human oversight

Background:

Methods:

Results:

Conclusions:

More from our Archive