Gaze Estimation Based on Visual State Space Model with Hybrid Features
Yujie Li, Rongjie Liu, Zhizun Zeng, Ziwen Wang, Yuhang Hong, Benying TanVisual State Space Model (denoted as VMamba), a vision-based model proposed to introduce Mamba into computer vision, has shown strong performance in recent work on computer vision tasks. However, the performance of VMamba in gaze estimation is still to be explored. In this paper, we propose two VMamba-based gaze estimation approaches: GazeVM-Pure based on pure VMamba and GazeVM-Hybrid based on hybrid VMamba. GazeVM-Pure is used to estimate gaze direction according to the original VMamba structure. GazeVM-Hybrid combines Convolutional Neural Network (CNN) and VMamba, where the Visual State Space (VSS) Block (the core module of VMamba) is used as a complementary component of the CNN. In GazeVM-Hybrid, the convolutional layers of ResNet-34 are used to learn local feature maps from face images, and VSS Block is used to capture global relations from feature maps. The experimental results show that GazeVM-Hybrid exhibits superior performance compared with existing state-of-the-art techniques with a nearly 0.11 decrease in angle error compared with Static Transformer Temporal Differential Network (STTDN) on the EyeDiap dataset.