DOI: 10.3390/healthcare14131877 ISSN: 2227-9032

Do Multimodal Vision-Language Models Enhance the Medical Diagnostic Process? A Systematic Review

Lattawat Eauchai, Laura Otálora González, Yifan Shi, Michele T. McGinnis, Alexander Yovchev, Svetlana Herasevich, Brian W. Pickering, Vitaly Herasevich

Background/Objectives: Novel vision-language models (VLMs) can integrate patient textual data with image data to support medical diagnosis. Recent studies reported conflicting results regarding the performance of multimodal VLMs compared to other models and physician performance. This systematic review aims to assess the diagnostic performance of multimodal VLMs integrating both patient textual and image data across diverse real-world hospital settings. Methods: We performed comprehensive searches of eight resources, including Embase, MEDLINE, and SCOPUS, on 17 December 2025. Eligible studies reporting diagnostic performance of VLMs integrating both image and patient history textual data from real-world adult patients compared to that of other models and physicians were included. The review adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. The Prediction model study Risk Of Bias Assessment Tool + AI (PROBAST + AI) was used to assess the quality and risk of bias. The study protocol was registered in the PROSPERO database (CRD420251244054). This review received no external funding. Results: We screened 11,026 records, of which 18 studies met the inclusion criteria. Six studies comparing multimodal and unimodal models demonstrated the consistent superiority of the multimodal models. Four studies evaluating VLM accuracy as standalone agents compared with physician performance reported conflicting evidence. One study assessing VLMs as a clinical copilot demonstrated higher accuracy from the group of physicians using VLM assistance. A meta-analysis could not be performed due to the heterogeneity across study populations and outcomes. The majority of the studies were assessed as having a high risk of bias due to dataset quality. Primary limitations identified across studies include small sample size, a lack of external validation, and the need for prospective clinical deployment studies. No study provided documented considerations regarding model safety or data security. Conclusions: This systematic review suggests that multimodal VLMs consistently outperform unimodal models with access to only image or text. While model performance as standalone agents compared to humans remains inconclusive, a copilot model has demonstrated high diagnostic accuracy. Given substantial methodological concerns across studies, cautious interpretation is required, No firm clinical recommendation can be made regarding the use of standalone VLMs. Further research employing high-quality datasets is needed to ensure the reliability and clinical applicability of future VLMs.

More from our Archive