DOI: 10.1002/jeo2.70813 ISSN: 2197-1153

Promising performance of locally deployed large language models for postoperative orthopaedic patient questions: An In Silico analysis

Lea Lanter, Sophie Masel, Bettina Hochreiter, Michel Meisterhans, Michael Rebsamen, Benedikt Herzog, Sebastiano Caprara, Felix C. Oettl

Abstract

Purpose

Generative artificial intelligence (AI), particularly large language models (LLMs), are increasingly utilised in healthcare. While this may reduce the initial workload of healthcare professionals, unvalidated model outputs can pose a relevant risk to patient safety if they are inaccurate, incomplete or inconsistent with recommendations. This study evaluates and compares the performance of local and commercial LLMs on patient questions to inform institutional strategies for patient communication and management.

Methods

Twenty postoperative patient questions were constructed and posed to GPT‐5, Claude 4.5 Sonnet (commercial models) and GPT‐OSS, Apertus (locally hosted). Responses were assessed using the QUEST framework evaluating quality, understanding, expression, safety and trust. Four blinded reviewers, two board‐certified fellowship‐trained orthopaedic surgeons and two orthopaedic surgery residents assessed responses. Overall model performance was compared using the Friedman test, with Wilcoxon signed‐rank tests for pairwise post hoc comparisons. Metrics where lower values indicate better performance were inverted so that higher values uniformly represent better performance across all metrics.

Results

The highest overall performance scores were observed for Claude 4.5 Sonnet (mean 0.949, SD 0.124) and GPT‐5 (0.937, 0.142), followed by GPT‐OSS (0.873, 0.189), while Apertus performed worst (0.693, 0.317). On QUEST dimensions, GPT‐5 and Claude 4.5 Sonnet achieved consistently high ratings, whereas Apertus scored lower across information quality, reasoning and expression. Safety‐relevant issues were concentrated in Apertus, with harmful content and fabrication each occurring in 22.5% of evaluated ratings (18/80). At the output level, this corresponded to harmful content in 9/20 unique model responses, and fabrication in 11/20 unique model responses. Inter‐rater agreement was moderate (overall Fleiss' κ  ≈ 0.50).

Conclusions

In this structured evaluation setting, commercial LLMs showed higher overall performance and fewer safety‐relevant ratings, but they remain externally controlled. Achieving privacy‐compliant, real‐time clinical integration will require advancing local LLMs through fine‐tuning, rigorous validation and robust safety guardrails.

Level of Evidence

Level V.

More from our Archive