Disentangled Multimodal Tuning and Interaction for Human Perception Understanding
Hao Sun, Xinyao Yu, Ziwei Niu, Jiaqing Liu, Yen-Wei Chen, Lanfen LinUnderstanding human perceptions poses a significant multimodal challenge for computers, involving textual, acoustic, and visual signals. Recently, large language models has garnered great attention, leading to numerous methods aimed at efficiently fine-tuning pretrained models for multimodal downstream tasks. However, there remains a scarcity of techniques that prioritize modality-invariant and -specific information during parameter-efficient tuning, despite evidence from previous studies showcasing the effectiveness of modality disentangling. To address this gap, we propose a novel multimodal tuning approach for large language models, termed as Disentangled Multimodal Tuning and Interaction. Specifically, we evaluate the independence among different modalities and disentangle corresponding modality-invariant and specific components, which are subsequently leveraged for prompt tuning. Following tuning, a newly designed independence-guided cross-attention module is introduced for modality interaction, where the attention mechanism is decoupled and bolstered with independence from the modality-disentangling process. This approach not only enables large language models to efficiently assimilate information from various modalities but also cultivates an awareness of both modality-invariant and specific information. Compared to previous methods, our approach facilitates modality interaction at a more granular level, resulting in enhanced performance. We validate our method through experiments on four public datasets, demonstrating significant performance improvements.