Mitigating Spectral Imbalance and Detail Attenuation in RGB-Thermal Object Detection via Frequency-Guided Multimodal Fusion
Quan Du, Ming Zhao, Lu Song, Minnan Hu, Zhengqiang Wang, Wangyu WuRGB-T object detection combines visible texture information with thermal saliency cues to improve detection under degraded illumination. Existing RGB-T fusion methods usually perform feature interaction in the spatial domain or treat spectral responses jointly, which may allow coarse background components to dominate the fusion process while weakening boundary and small-target details. In addition, the repeated upsampling and aggregation operations in the detection neck can further smooth high-frequency responses preserved during early fusion. This paper proposes F2Net, a frequency-guided RGB-T object detection framework built on a dual-stream YOLOv11s architecture. The method decomposes RGB and thermal features into low- and high-frequency components for separate cross-modal fusion, mitigates detail attenuation during neck decoding, and regularizes spatial correspondence between RGB and thermal representations during training. On M3FD, F2Net achieves 89.6% mAP@0.5 and 62.1% mAP@0.5:0.95, improving the Dual-YOLOv11s baseline by 7.7 and 6.6 percentage points, respectively, while increasing the parameter count from 13.8M to 15.4M and GFLOPs from 33.9G to 35.6G. Additional experiments on LLVIP and KAIST evaluate the method under low-light and road-scene conditions. The KAIST results show that high-IoU localization remains challenging in dense and occluded pedestrian scenes. This indicates that frequency-guided fusion mainly strengthens target response generation and moderate-IoU detection, but it does not fully solve precise boundary regression under severe occlusion and weak contour conditions.