DOI: 10.3390/s26133998 ISSN: 1424-8220

Medical Vision-Language Models: Existing Technologies, Clinical Applications and Future Directions

Le Zou, Mengyu Ma, Jun Li, Hao Chen, Shuang Peng

Medical image analysis is a cornerstone of modern healthcare, yet conventional single-modal deep learning often struggles with the unique physical constraints and structural variability inherent in data acquired from diverse medical sensors. Recently, Vision-Language Models (VLMs) have sparked a paradigm shift by bridging the semantic gap between visual sensor signals and clinical narratives. Following the PRISMA guidelines, 167 representative studies are systematically synthesized in this review to provide a comprehensive roadmap of VLM technological evolution and clinical utility. First, rather than treating VLMs as generic feature extractors, their underlying mechanisms are uniquely distilled into seven core operational principles, which are then explicitly mapped to downstream applications such as few-shot diagnosis, prompt-driven segmentation, and multi-task foundation models. To facilitate intuitive evaluation, a rigorous quantitative cross-comparison of current benchmark architectures is presented. Crucially, this review goes beyond highlighting successes by critically assessing prevalent clinical bottlenecks, including zero-shot segmentation failures, multi-modal hallucinations in diagnosing rare diseases, and the prohibitive computational complexity associated with 3D volumes and gigapixel whole slide images. Finally, a novel, forward-looking framework is proposed: the transition from static “image-text alignment” to dynamic “multi-source sensor-driven intelligence”. By addressing both physical sensor constraints and algorithmic limitations, this survey offers actionable insights for developing trustworthy, sensor-aware clinical diagnostic agents.

More from our Archive