DOI: 10.1002/eng2.70892 ISSN: 2577-8196

Construction and Application of GAN Enhanced Virtual Interpretation Model for Sports Communication

Li Zhang, Hongyuan Xie, Jun Zhang, Ying Yang

ABSTRACT

At present, virtual commentators used in sports video broadcasting still exhibit issues such as unnatural facial expressions and inconsistencies between lip movements and audio synchronization. These problems prevent audiences from obtaining a more authentic viewing experience. In order to improve the facial realism, the synchronization between lip and audio, and the naturalness of expression of the virtual narrator in sports video communication, this study proposes a virtual narrator model based on the enhancement of Generative Adversarial Network (GAN). This model is an end‐to‐end framework. Firstly, the style‐based generation network Style‐based GAN2 (StyleGAN2) is used to generate a highly realistic and adjustable static narrator portrait. Then, Bi‐directional Long Short‐Term Memory (Bi‐LSTM) is used to encode the Mel‐frequency Cepstral Coefficients (MFCCs), phonemes, and prosodic features of the explanatory audio, and the parameters of the 3D Morphable Model (3DMM) are guided to generate accurate facial action sequences. Finally, the neural rendering technology, based on a time sequence smoothing constraint, is introduced to fuse the generated dynamic portrait with the background video of sports events, thereby synthesising the final commentary video. The experimental results show that the video generated by this model reaches 18.7 in Fréchet Inception Distance (FID), and the lip synchronization error is 5.1. In comparative evaluations against real‐world videos across different sports scenarios, an average of 50% of participants considered the generated videos to be “difficult to distinguish from those featuring real humans.” The ablation experiment further confirmed the effectiveness of key designs such as style mixing and timing smoothing loss. After the removal of the SM (Style Mixing) module, the model's FID increased by 2.6. Eliminating the temporal smoothing loss led to increases of 0.7 in both the lip‐sync error and LMD (Landmark Distance). Furthermore, removing the AT (Adversarial Training) module resulted in a 6.9 increase in the model's FID. To sum up, the proposed model can generate virtual commentary content with high visual realism, accurate lip synchronization, and natural expression, which provides a feasible technical idea for the intelligent production of sports communication.

More from our Archive