WearSE: Enabling Streaming Speech Enhancement on Eyewear Using Acoustic Sensing
Qian Zhang, Kaiyi Guo, Yifei Yang, Dong WangSmart eyewear has rapidly evolved in recent years, yet its mobile and in-the-wild characteristics often make voice interactions on such devices susceptible to external interferences. In this paper, we introduce WearSE, a system that utilizes acoustic signals emitted and received by speakers and microphones mounted on eyewear to perceive facial movements during speech, achieving multimodal speech enhancement. WearSE incorporates three key designs to meet the high demands for real-time operation and robustness on smart eyewear. First, considering the frequent use in mobile scenarios, we design a sensing-enhanced network to amplify the capability of acoustic sensing, eliminating dynamic multipath interferences. Second, we develop a lightweight speech enhancement network that enhances both the amplitude and phase of the speech spectrum. Through a casual network design, computational demands are significantly reduced, ensuring real-time operation on mobile devices. Third, addressing the scarcity of paired data, we design a memory-based back-translation mechanism to generate pseudo-acoustic sensing data using a large amount of publicly available speech data for network training. We construct a prototype system and extensively evaluate WearSE through experiments. In multi-speaker scenarios, our approach exhibits much better performance than pure audio speech enhancement methods. Comparisons with commercial smart eyewear also demonstrate that WearSE significantly surpasses existing noise reduction algorithms in these devices. The audio demo of WearSE is available on https://github.com/WearSE/wearse.github.io.