Disentangled Multimodal Tuning and Interaction for Human Perception Understanding

doi:10.1145/3828163

DOI: 10.1145/3828163 ISSN: 1551-6857

Disentangled Multimodal Tuning and Interaction for Human Perception Understanding

Hao Sun, Xinyao Yu, Ziwei Niu, Jiaqing Liu, Yen-Wei Chen, Lanfen Lin

Understanding human perceptions poses a significant multimodal challenge for computers, involving textual, acoustic, and visual signals. Recently, large language models has garnered great attention, leading to numerous methods aimed at efficiently fine-tuning pretrained models for multimodal downstream tasks. However, there remains a scarcity of techniques that prioritize modality-invariant and -specific information during parameter-efficient tuning, despite evidence from previous studies showcasing the effectiveness of modality disentangling. To address this gap, we propose a novel multimodal tuning approach for large language models, termed as Disentangled Multimodal Tuning and Interaction. Specifically, we evaluate the independence among different modalities and disentangle corresponding modality-invariant and specific components, which are subsequently leveraged for prompt tuning. Following tuning, a newly designed independence-guided cross-attention module is introduced for modality interaction, where the attention mechanism is decoupled and bolstered with independence from the modality-disentangling process. This approach not only enables large language models to efficiently assimilate information from various modalities but also cultivates an awareness of both modality-invariant and specific information. Compared to previous methods, our approach facilitates modality interaction at a more granular level, resulting in enhanced performance. We validate our method through experiments on four public datasets, demonstrating significant performance improvements.

Outline

Disentangled Multimodal Tuning and Interaction for Human Perception Understanding

More from our Archive