DOI: 10.3390/app15010438 ISSN: 2076-3417

Adaptive Multimodal Fusion with Cross-Attention for Robust Scene Segmentation and Urban Economic Analysis

Chun Zhong, Shihong Zeng, Hongqiu Zhu

With the increasing demand for accurate multimodal data analysis in complex scenarios, existing models often struggle to effectively capture and fuse information across diverse modalities, especially when data include varying scales and levels of detail. To address these challenges, this study presents an enhanced Swin Transformer V2-based model designed for robust multimodal data processing. The method analyzes urban economic activities and spatial layout using satellite and street view images, with applications in traffic flow and business activity intensity, highlighting its practical significance. The model incorporates a multi-scale feature extraction module into the window attention mechanism, combining local and global window attention with adaptive pooling to achieve comprehensive multi-scale feature fusion and representation. This approach enables the model to effectively capture information at different scales, enhancing its expressiveness in complex scenes. Additionally, a cross-attention-based multimodal feature fusion mechanism integrates spatial structure information from scene graphs with Swin Transformer’s image classification outputs. By calculating similarities and correlations between scene graph embeddings and image classifications, this mechanism dynamically adjusts each modality’s contribution to the fused representation, leveraging complementary information for a more coherent multimodal understanding. Compared with the baseline method, the proposed bimodal model performs superiorly and the accuracy is improved by 3%, reaching 91.5%, which proves its effectiveness in processing and fusing multimodal information. These results highlight the advantages of combining multi-scale feature extraction and cross-modal alignment to improve performance on complex multimodal tasks.

More from our Archive