Self-Supervised Bidirectional State Space Modeling for Voiceprint Feature Representation and Recognition
Junju Lai, Wei Wang, Guangyao Li, Zhichong Kong, Chao Yuan, Qian ZhouAs substation equipment continues to evolve toward higher voltage levels, larger capacities, and more complex operating conditions, voiceprint signals exhibit greater sensitivity and observability during the early stages of faults. However, traditional modeling approaches still suffer from limitations in capturing long-range temporal dependencies, suppressing noise interference, and adapting to unlabeled data. To address these issues, a state space model-based Mamba self-supervised voiceprint framework, termed MSANet, is proposed. A bidirectional state space scanning mechanism is introduced into the network architecture to avoid the high computational complexity of attention mechanisms while simultaneously preserving both global contextual correlations and local detail representations of voiceprint signals. In addition, a spectrum block masking-based self-supervised learning strategy is incorporated, enabling the model to extract stable time–frequency structural features even under unlabeled or limited labeled samples. Experimental results demonstrate that MSANet achieves high accuracy in voiceprint-related tasks. Furthermore, the lightweight version of the model maintains competitive performance while significantly reducing computational and storage overhead, indicating its feasibility for deployment on edge devices in resource-constrained scenarios such as substation environments. The proposed method provides a potential methodological basis for enhancing fault-related voiceprint feature extraction, representation learning, and future practical engineering deployment.