EDM-Net: A Multi-Scale Network for Object Detection in Remote Sensing Images
Shuai Liang, Xiao Wang, Jialong Sun, Hui Liu, Huilei YangRemote sensing object detection remains challenging because objects often appear with large scale variation, dense spatial layouts, and strong interference from complex geographical backgrounds. To address these coupled difficulties, we propose EDM-Net, an end-to-end multi-scale detector that organizes feature processing into three coordinated stages: adaptive extraction, intra-scale interaction, and cross-scale fusion. First, an efficient sparse mixture-of-experts (ES-MoE) module is embedded in the backbone to allocate scale-specific convolutional experts according to scene-level feature responses, providing a more adaptive feature basis than a single static extraction path. Second, a dynamic mixing intra-scale feature interaction (DMIFI) module is introduced into the Transformer encoder. This module combines global self-attention with dynamic spatial mixing, thereby preserving long-range context while reintroducing local two-dimensional inductive bias for dense and small objects. Third, a multi-scale synergistic attention fusion (MSAF) module aligns adjacent feature levels through parallel local and global attention branches and structural re-parameterization, reducing semantic dilution during feature aggregation. Comprehensive experiments on three large-scale remote sensing benchmark datasets, DIOR, NWPU VHR-10, and RSOD, demonstrate that EDM-Net consistently improves over the re-trained RT-DETR-R18 baseline under the same experimental protocol, attaining mAP50 scores of 83.7%, 95.6%, and 95.8% respectively. Additional ablation and scale-specific analyses indicate that the three modules contribute complementary gains, especially for small and densely distributed objects. These results suggest that coordinated extraction, interaction, and fusion can improve remote sensing object detection under complex scale and background conditions.