MCF-YOLO: Consistency-Guided Cross-Modal Attention for Small-Object RGB-IR Detection
Xiang Yang, Mengyue Yang, Xiaolan XieIn low-light, occluded, and cluttered environments, single-modality RGB detectors are prone to false positives and missed detections. While infrared (IR) imaging provides relatively stable target visibility under poor illumination, it lacks texture and color information and is susceptible to background thermal noise and imaging variations. To address these limitations, this paper proposes an RGB–IR object detection network, named MCF-YOLO, consisting of three core components. First, the Cross-Modal Hierarchical Fusion (CMHF) module performs stage-wise alignment and fusion on multi-scale features, jointly modeling RGB texture details and IR thermal responses to exploit the structural and semantic complementarity between the two modalities. Second, the Soft Attention Regularization based on Attention Prior (SAR-AP) module derives attention priors from IR features to impose soft constraints on cross-modal attention maps. This mechanism helps the network maintain attention on target-relevant regions, thereby suppressing attention drift caused by low-light noise and complex backgrounds. Third, the Small-Object-Sensitive Detection Head (SOS-Head) processes high-resolution features to strengthen the representation of small targets, improving detection capability in long-range and occluded scenarios. In evaluations on two RGB–IR benchmarks—M3FD and VEDAI—MCF-YOLO achieves improvements of 2.7% in mAP@0.5 and 1.1% in mAP@0.5:0.95 on M3FD, and 5.4% and 4.4%, respectively, on VEDAI. These results suggest that consistency-guided cross-modal fusion and high-resolution small-target modeling are beneficial for RGB–IR detection in low-visibility and cluttered scenes.