A Latent-Guided Framework for Text-Based Full-Body Human Motion Generation
Jannatul Nayeem, Hak-Bum Lee, Young-Ho SeoText-to-motion generation aims to synthesize realistic human motion sequences that accurately reflect natural language descriptions. While recent approaches have improved motion quality, achieving strong semantic alignment between text and motion, especially for fine-grained articulations, remains a significant challenge. In this work, we propose a latent-guided text-to-motion generation framework that strengthens the interaction between textual representations and motion latent sequences. The proposed method integrates a structured motion latent space with a text-conditioned variational generation module, enhanced by a cross-modal attention mechanism. This design enables the model to effectively capture both global motion dynamics and detailed semantic information from text. Extensive experiments on the Motion-X dataset demonstrate that the proposed approach achieves strong semantic alignment, as reflected by improved R-precision and competitive matching performance. In addition, the model improves multi-modality, indicating its ability to generate diverse motion patterns under the same textual condition. Qualitative results further show that the generated motions preserve core action semantics and exhibit coherent temporal dynamics across different motion categories. Overall, the proposed framework provides an effective solution for improving text–motion alignment in high-dimensional motion spaces, highlighting the importance of latent-guided modeling for realistic and semantically consistent motion generation.