DOI: 10.1111/exsy.70342 ISSN: 0266-4720

Multi‐Level Feature Fusion and Interaction for Cross‐Modal Fashion Retrieval

Runbing Wu, Jia Ren, Zijian Hu, Yani Cui, Bo Li, Guanglong Huang

ABSTRACT

Cross‐modal retrieval is a fundamental task in the fashion domain, where the key challenge lies in effectively aligning semantic information and learning discriminative modality representations. However, this task remains highly challenging due to the semantic gap, insufficient utilization of fine‐grained features, and limitations of training strategies. To address these issues, this paper proposes a Multi‐Level Feature Fusion Interaction Network (MLFFI‐Net). The framework leverages a pyramid architecture for multi‐scale feature modelling and incorporates a Multi‐modal Transformer encoder block (MTEB) to achieve collaborative alignment of global and local semantics. A Dynamic Gated Fusion Mechanism (DGFM) is introduced to aggregate hidden states across Transformer layers, thereby generating richer text representations. In addition, an Alternating Progressive Training Strategy (APTS) is designed to fully exploit different input data streams during joint training, effectively coordinating multiple learning tasks and enhancing overall performance. Experimental results on the FashionGen dataset demonstrate that MLFFI‐Net significantly outperforms existing approaches in cross‐modal retrieval for the fashion domain, validating the effectiveness of the proposed method in semantic alignment and feature representation.

More from our Archive