Towards Robust Multimodal Detection via Progressive Cross‐Domain Feature Fusion
Yaqiong Wang, Yipei Wang, Baoguo Shen, Zhaohui Dang, Zhenhua HouABSTRACT
Multimodal object detection with RGB and thermal infrared imagery is essential for reliable perception in complex environments, but existing methods often remain limited by insufficient cross‐modal feature fusion. To address this issue, we propose PFDet, a progressive cross‐domain feature fusion framework for robust multimodal detection. PFDet progressively aligns semantic cues, interacts cross‐modal features, and refines multi‐scale representations. Specifically, a Semantic Alignment Guidance (SAG) module establishes a unified semantic reference at the input level to guide subsequent fusion. Unified Cross‐Modal Fusion Modules (UCMF) are then deployed across the backbone to enable fine‐grained bidirectional feature interaction, while a Context Guide Fusion Module (CGFM) performs context‐aware multi‐scale refinement in the neck network. Experimental results on the M3FD and FLIR‐Aligned datasets show that PFDet achieves state‐of‐the‐art performance, reaching 86.7% mAP@0.5 on M3FD and 86.9% mAP@0.5 on FLIR‐Aligned, while maintaining real‐time inference speed. Ablation studies further validate the effectiveness of the proposed progressive fusion strategy. This work provides a robust and scalable paradigm for multimodal object detection, with the potential to be extended to other heterogeneous sensing modalities, contributing to reliable perception in complex real‐world environments.