LSCA-RCNN: Large-Kernel Spatial Residual and Cascade Attention Network for Voxel-Based 3D Object Detection
Yuyang Liu, Zhanyuan Jiang, Min Mao, Kun Zhang, Yu Xu, Mingchen Zhu, Xianjun WuLiDAR-based 3D object detection remains challenging due to sparse and irregular point cloud distributions, which degrade detection accuracy for small and occluded objects. In view of this, this paper proposes a novel two-stage voxel-based 3D detector, namely LSCA-RCNN, to address these issues. First, spatial residual blocks (SRBs) and large-kernel spatial-wise convolutions are integrated into the 3D backbone to suppress feature degradation and to expand the receptive fields for stable multi-scale feature learning. Second, a ConvNeXt-based 2D backbone with spatial attention is constructed to enhance discriminative feature representation of small objects. Third, a cascaded detection head embedded with fine-grained grouped convolutions and cross-stage cross-attention is designed to achieve progressive bounding box refinement and to improve localization precision. Extensive evaluations on the KITTI dataset with the R40 metric show that the proposed method achieves consistent performance improvements over the baseline. In the moderate setting, LSCA-RCNN increases the 3D AP by 2.12%, 7.66%, and 5.43% for cars, pedestrians, and cyclists, respectively, while achieving gains of 1.62%, 5.05%, and 7.05% under the hard setting. These results validate the effectiveness and robustness of the proposed LSCA-RCNN for complex and challenging autonomous driving detection tasks.