DOI: 10.1177/01423312241308367 ISSN: 0142-3312

BEV transformer for visual 3D object detection applied with retentive mechanism

Jincheng Pan, Xiaoci Huang, Suyun Luo, Fang Ma

Three-dimensional (3D) vision perception tasks utilizing multiple cameras are pivotal for autonomous driving systems, encompassing both 3D object detection and map segmentation. We introduce a novel approach dubbed RetentiveBEV, leveraging Transformer to learn spatiotemporal features from Bird’s Eye View (BEV) perspectives. These BEV representations form the foundational layer for further autonomous driving tasks. Succinctly, spatial features within regions of interest (ROIs) are harvested via spatial cross-attention, while temporal dynamics are integrated using temporal self-attention, enriching the BEV with historical data. Our spatial cross-attention is enhanced with a retentive mechanism, prioritizing information surrounding the focal points and enabling the decomposition of this attention mechanism to bolster computational efficiency. On the nuScenes data set test split, our approach achieves a nuScenes Detection Score (NDS) score of 60.4%, without additional training data, which is an 8.7% improvement over the baseline (BEVFormer-base), and is close to the current state-of-the-art method SparseBEV, which gets NDS 65.7% as of August 2024. On the Val split of nuScenes, our method achieves the performance of 55.8 NDS while maintaining a real-time inference speed of 25.3 FPS, and we are currently working on further accelerating inference using TensorRT on the existing basis (the specification of mAP and NDS would be illustrated by equations (12) and (13)). The integration of the retentive mechanism notably boosts the precision and recall in 3D object detection while also expediting the inference process.

More from our Archive