DOI: 10.3390/app16136420 ISSN: 2076-3417

Target Speaker Extraction with Cross-Correlation for Complex Spectra and Dual Post-Refinements

Sangwook Han, Seonggyu Lee, Jong Won Shin

Target speaker extraction (TSE) aims to isolate speech spoken by a target speaker out of a mixture using speaker information in an enrollment utterance. Recently, several methods have been proposed that exploit the relationship between the enrollment utterance and the input mixture using cross-attention, without extracting speaker embeddings from the enrollment. Previous approaches applied the cross-attention to the encoded representations or to the real and imaginary parts of the compressed spectrograms separately, which may not have a physical meaning. In this paper, we propose a two-stage TSE method with a physically interpretable modified cross-attention block and a dual post-refinement structure. In the first stage, the attention weights to fuse the enrollment and mixture are derived from the cross-correlation between the complex spectra for the two signals in a form analogous to the phase-sensitive mask. The fused features along with the mixture features were subsequently fed into a speech extraction network to obtain a coarsely extracted target speech. The second stage consists of two parallel branches, where one branch refines the first-stage output using the enrollment in a similar way to the first stage, and the other utilizes the mixture to complement possibly attenuated target speech. In addition, the low-dimensional speaker embeddings extracted from the enrollment and the first-stage output are incorporated into the second stage to exploit the speaker discriminability. Experimental results show that the proposed method consistently outperformed existing TSE methods on the Libri2Mix dataset under both clean and noisy conditions, in terms of speech quality, speech intelligibility, and signal distortion measures.

More from our Archive