DOI: 10.3390/app16136613 ISSN: 2076-3417

Dual-Track Residual Framework for Residual Strength-Controlled Emotional Speech Synthesis

Youdong Ding, Yafan Geng, Wenjing Yu, Feifan Cai

Recent text-to-speech (TTS) systems can synthesize natural and intelligible speech, but adding controllable emotional expression to a pretrained model while preserving target-speaker identity remains challenging. This setting is especially constrained when the acoustic backbone is kept frozen and emotional adaptation relies on additional trainable modules. We study emotional adaptation for a frozen flow-matching TTS backbone and propose the dual-track residual framework (DTRF). The DTRF keeps the neutral-adapted Matcha-Base backbone unchanged, represents emotion as a neutral-anchor residual, and introduces two residual control paths: an asymmetric zero-initialized acoustic control branch for spectral vector field modulation and an emotional duration adapter (EDA) for duration-level prosody control. Rather than injecting emotion only into the acoustic path, the DTRF applies emotion control to both acoustic vector field prediction and phoneme duration prediction, jointly adjusting spectral realization and temporal prosody. A global neutral anchor converts absolute emotion embeddings into relative residuals so that the control signal describes the deviation from neutral speech toward the target emotion rather than an absolute style vector. During inference, a shared scalar factor α scales both residual paths, providing a practical residual strength interface for controllable emotion rendering. Moderate α values tend to increase emotional salience, whereas larger extrapolative values introduce trade-offs in naturalness, speaker similarity, and intelligibility. Experiments on the English subset of the Emotional Speech Dataset (ESD) show that the DTRF improves emotion-related metrics relative to the internal full-parameter updating and style token conditioning baselines, while maintaining a practical balance among speaker similarity, naturalness, and intelligibility. The emotion control modules contain approximately 27 M trainable parameters, corresponding to 29.61% of the full model parameters and a 70.39% reduction compared with full-parameter updating. These results suggest that jointly modeling acoustic and duration residuals can be an effective strategy for adding residual strength-controlled emotional rendering to a frozen flow-matching TTS model without full-backbone updating.

More from our Archive