DOI: 10.1145/3757322 ISSN: 1551-6857

Language-guided Visual Tracking: Comprehensive and Effective Multimodal Information Fusion

Jianbo Song, Hong Zhang, Yachun Feng, Hanyang Liu, Yifan Yang

Current vision-language trackers often struggle to fuse multimodal information comprehensively and effectively, leading to suboptimal performance in multimodal tasks. This study introduces LGTrack, a novel language-guided visual tracking framework designed to achieve a more comprehensive and efficient fusion of vision and language information. In the encoding stage, an Enhanced Multimodal Interaction Module is proposed to achieve full multimodal fusion, and it is used to construct Early Language Multilevel Guided Multimodal Encoding, which leverages deep semantic information for early and multilevel guidance of vision encoding. In the decoding stage, a multimodal decoding based on Joint Query is proposed, utilizing global features from both vision and language modalities, guiding the efficient operation of the decoding layers. These innovations achieve a more comprehensive fusion of multimodal information. Additionally, a contrastive learning strategy is introduced to align vision-language features in the semantic space, further enhancing the fusion effectiveness. Extensive experiments on multiple benchmarks such as LaSOT, \(\rm{LaSOT_{ext}}\) , TNL2K and OTB99-Lang demonstrate that our approach outperforms existing state-of-the-art trackers.

More from our Archive