DOI: 10.1049/ipr2.70422 ISSN: 1751-9659

Multi‐Attention Guided Feature Collaborate Network for RGB‐D Semantic Segmentation

Junhao Song, Chenxi Du, Chen Wang, Haisheng Li, Ruijun Liu, Yue Qi

ABSTRACT

Indoor RGB‐D semantic segmentation benefits from depth information for enhanced scene understanding, yet faces challenges including modality discrepancies, noise interference and loss of fine‐grained details during feature extraction. Although cross‐modal fusion for RGB‐X semantic segmentation (CMX) enables deep cross‐modal correction and fusion, it remains limited in noise suppression, detailed feature handling, and multiscale feature utilization. To address these limitations, we propose the multi‐attention guided feature collaborate network (MAGFCNet), an attention‐based feature coordination network built upon CMX. MAGFCNet introduces a three‐stage feature calibration module (FCM), where the first two stages perform progressive calibration through channel‐wise calibration followed by channel‐spatial joint calibration to suppress modality‐specific noise, followed by a dynamic fusion stage for cross‐modal feature integration. A feature aggregation module (FAM) then combines efficient cross‐attention for global information exchange with a contextual feature fusion module (CFFM) for spatial‐domain adaptive fusion, where CFFM leverages a global–local feature fusion module (G‐LFFM), a lightweight yet effective design inspired by existing attention mechanisms, to balance global context and local details through parallel branches. Furthermore, a top‐down decoder employs CFFM with dual‐path enhancement to progressively integrate multiscale features while recovering structural boundaries. Extensive experiments on NYU Depth v2 and SUN‐RGBD demonstrate that MAGFCNet achieves consistent improvements over CMX across multiple settings. With a MiT‐B5 backbone, MAGFCNet achieves measurable improvements of 0.19% mIoU and 0.23% PA on NYU Depth v2, and with a MiT‐B2 backbone, it improves mIoU by 0.45% on SUN‐RGBD. Qualitative results further show that MAGFCNet provides more robust performance in reflective, occluded and structurally ambiguous regions, demonstrating enhanced boundary preservation and improved contextual reasoning.

More from our Archive