CG‐VTON: Controllable Generation of Virtual Try‐On Images Based on Multimodal Conditions
Haopeng Lei, Xuan Zhao, Yaqin Liang, Yuanlong CaoABSTRACT
Transforming fashion design sketches into realistic garments remains a challenging task due to the reliance on labor‐intensive manual workflows that limit efficiency and scalability in traditional fashion pipelines. While recent advances in image generation and virtual try‐on technologies have introduced partial automation, existing methods still lack controllability and struggle to maintain semantic consistency in garment pose and structure, restricting their applicability in real‐world design scenarios. In this work, we present CG‐VTON, a controllable virtual try‐on framework designed to generate high‐quality try‐on images directly from clothing design sketches. The model integrates multi‐modal conditional inputs, including dense human pose maps and textual garment descriptions, to guide the generation process. A novel pose constraint module is introduced to enhance garment‐body alignment, while a structured diffusion‐based pipeline performs progressive generation through latent denoising and global‐context refinement. Extensive experiments conducted on benchmark datasets demonstrate that CG‐VTON significantly outperforms existing state‐of‐the‐art methods in terms of visual quality, pose consistency, and computational efficiency. By enabling high‐fidelity and controllable try‐on results from abstract sketches, CG‐VTON offers a practical and robust solution for bridging the gap between conceptual design and realistic garment visualization.