Do It Once: Concatenating the Image Pair for a Single Pass Feature Extraction in Stereo Depth Sensing
Žan Regoršek, Andrej ŽemvaIn the field of stereo depth sensing, modern research predominantly prioritizes accuracy, yet inference speed remains a critical bottleneck for practical, real-time applications on resource-constrained platforms. Existing acceleration approaches often rely on lighter network architectures or runtime-specific optimizations, which may require architectural redesign, platform-specific tuning, or accuracy trade-offs. However, a common inefficiency remains in many stereo pipelines: feature extraction is typically performed using two separate forward passes, one for the left image and one for the right, even though both passes use the same network weights. We address this redundancy by concatenating the left and right images into a single combined tensor, enabling feature extraction in one batched pass while preserving the original network architecture. By reducing feature extraction time by up to 48.4%, our results demonstrate that this method accelerates the overall inference rate by 10% to 39% on average on Nvidia V100 and up to 28.4% on edge device, depending on the model architecture. This speedup is achieved at the expense of only a moderate increase in runtime memory consumption, while retaining the original accuracy. Because the method does not alter the core stereo network, it can be applied as a plug-and-play enhancement to both existing and newly developed stereo matching models.