DOI: 10.3390/s26134175 ISSN: 1424-8220

FAV-DenoiseNet: An Audio–Visual Speech Enhancement Framework Based on Conditional Flow Matching and Visual Encoding

Xuan Fu, Lulu Qin, Weijing Liu, Mingchen Sun, Dadong Wang

Audio–visual speech enhancement aims to recover clean speech by jointly using noisy acoustic signals and synchronized visual cues. Although diffusion-based methods achieve promising restoration performance, their multi-step sampling causes high inference latency and computational cost, limiting real-time deployment. To address this issue, this paper proposes FAV-DenoiseNet, a two-stage framework based on discriminative prior denoising and conditional residual flow matching. The first stage uses a pre-trained discriminative denoising network to suppress dominant noise and provide a structurally stable speech prior. The second stage reformulates enhancement as residual compensation between the first-stage output and the clean speech spectrum instead of directly predicting the entire clean spectrum. A conditional flow-matching network estimates the residual from zero-residual initialization through single-step inference, reducing generative sampling cost. Multi-scale cross-modal attention provides adaptive visual guidance for audio refinement at different resolutions. A residual-controlled fusion strategy preserves the stable structure recovered by the first stage while compensating for residual noise, high-frequency details, and weak speech components. The experimental results show that FAV-DenoiseNet achieves PESQ, ESTOI, and SI-SDR scores of 2.805, 0.775, and 12.480 dB on VoxCeleb2 and 3.157, 0.876, and 13.281 dB on GRID, respectively, with an RTF of 0.086. These results demonstrate that the proposed framework effectively balances enhancement quality, detail restoration, and real-time inference efficiency.

More from our Archive