From Challenges to Solutions: A Systematic Review of Multimodal LLMs in Healthcare and Architectural Recommendations with RAG
Dipanita Saha, Yuan Lin, Debasish GhoseWith the evolution of intelligent systems, there is an increasing need to incorporate multimodal inputs, such as medical images, clinical text, structured records, and physiological signals, to facilitate precise and contextually informed clinical decision-making. To cope with multimodal healthcare data, challenges like computational complexity, data heterogeneity, and privacy concerns are rising regardless of advancements in large language model (LLM) capabilities. Applying the PRISMA methodology, this systematic review examined the present status of research on multimodal large language models (MLLMs) in healthcare, published between 2019 and 2024. A total of 164 studies were reviewed, focusing on the types of multimodal data utilized, the fusion techniques and LLM architectures applied, together with the key challenges and emerging opportunities in multimodal processing. The analysis reveals that many existing approaches still lack a comprehensive understanding of MLLM architectures and their role in clinical decision-making, and they also often lack robust privacy-preserving mechanisms and sufficient validation in real-world clinical workflows. Based on these observations, we recommend a customizable framework consisting of interoperable components that can be configured to meet the needs of specific clinical use cases. Instead of proposing a new algorithm or rigid pipeline, this recommendation incorporates a retrieval-augmented generation (RAG) framework coupled with an LLM that can integrate multimodal healthcare data. A central focus of this design is the integration of privacy-preserving retrieval strategies with RAG workflows to ensure secure handling of multimodal data, which highlights the need to examine the role of multimodal data encoding and storage, encrypted retrieval, reranking of retrieved data, multimodal prompt engineering, and secure content generation. Together, these components enable the integration of RAG with MLLMs in a way that supports privacy-aware, explainable, and clinically grounded decision support. The insights presented aim to guide future efforts to develop secure, explainable, and clinically integrated intelligent systems that take advantage of RAG and multimodal LLM.