DOI: 10.3390/jcm15134919 ISSN: 2077-0383

Between Accessibility and Reliability: High Confidence, Low Control in General-Purpose Multimodal Models for Hip Fracture Radiograph Interpretation

Hadar Gan-Or, Shaked Ankol, Guy Ben Arie, Itay Ashkenazi, Yaniv Warschawski

Background: Dedicated artificial intelligence (AI) systems for fracture detection already exist, yet general-purpose multimodal models are increasingly accessible to clinicians despite not being developed or formally validated as medical devices. Their behavior in focused orthopedic imaging tasks remains insufficiently characterized. Purpose: To characterize how two accessible general-purpose multimodal models interpret AP pelvis radiographs with hip fractures, focusing on context dependence, overconfidence, and complementary error patterns within a surgically confirmed positive-only cohort. This was a behavioral characterization study of a fracture-positive cohort, not a diagnostic accuracy evaluation. Methods: In April 2026, we retrospectively studied 214 surgically confirmed hip fractures on AP pelvis radiographs using two general-purpose multimodal models under six prompting conditions. In runs A–D, the models were explicitly told that a hip fracture was present and were asked to classify it; in runs E–F, they were not told whether a hip fracture was present. Each image was rerun de novo in a separate chat session through vendor APIs using a fixed base prompt and no image preprocessing. We recorded hip-fracture detection, correct laterality, coarse fracture pattern, intracapsular displacement, AO/OTA grading, subtrochanteric identification, and self-reported confidence. Because the cohort contained hip fractures only, we report fracture-detection rates and classification performance within a positive-only cohort rather than full diagnostic-accuracy metrics. Results: Using the more conservative endpoint of hip-fracture detection with correct laterality, GPT-5.4 was correct in 79.0% and 86.4% of cases in runs E and F, whereas Gemini was correct in 80.4% and 93.5%, respectively. When outputs from both models were combined, this endpoint reached 89.7% in run E and 96.7% in run F, indicating complementary rather than redundant error patterns. Incorrect laterality cues markedly degraded performance, from 90.7% to 66.4% in GPT-5.4 and from 97.7% to 57.0% in Gemini. Performance remained limited for treatment-relevant subtyping, particularly AO/OTA grading and subtrochanteric identification. Both models frequently remained highly confident when wrong, and self-reported confidence did not reliably distinguish correct from incorrect outputs. Conclusions: Accessible general-purpose multimodal models showed partial capability for coarse hip-fracture interpretation, but they remained context-sensitive, unreliable for treatment-relevant subtyping, and highly confident even when incorrect. Their complementary error patterns are hypothesis-generating rather than evidence of clinical readiness. On the basis of these findings, we do not support unvalidated or uncontrolled clinical use of such models. As access to these tools expands, explicit usage boundaries, minimum performance expectations, repeated local revalidation, and sustained human oversight become increasingly necessary.

More from our Archive