Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing

doi:10.3390/s26123919

DOI: 10.3390/s26123919 ISSN: 1424-8220

Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing

Žan Regoršek, Andrej Žemva

In the field of stereo depth sensing, modern research predominantly prioritizes accuracy, yet inference speed remains a critical bottleneck for practical, real-time applications on resource-constrained platforms. Existing acceleration approaches often rely on lighter network architectures or runtime-specific optimizations, which may require architectural redesign, platform-specific tuning, or accuracy trade-offs. However, a common inefficiency remains in many stereo pipelines: feature extraction is typically performed using two separate forward passes, one for the left image and one for the right, even though both passes use the same network weights. We address this redundancy by concatenating the left and right images into a single combined tensor, enabling feature extraction in one batched pass while preserving the original network architecture. By reducing feature extraction time by up to 48.4%, our results demonstrate that this method accelerates the overall inference rate by 10% to 39% on average on Nvidia V100 and up to 28.4% on edge device, depending on the model architecture. This speedup is achieved at the expense of only a moderate increase in runtime memory consumption, while retaining the original accuracy. Because the method does not alter the core stereo network, it can be applied as a plug-and-play enhancement to both existing and newly developed stereo matching models.

Outline

Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing

More from our Archive