DMV
‐
CLIP
: Disentangled Multimodal Visual Adaptation for Text‐Driven Face Editing
Minghao Li, Fan Zhang, Xin Wei, Huan Wan, Haoruo Zhang, Xuhui Huang ABSTRACT
Text‐driven face editing has attracted widespread interest due to its intuitive control and user‐friendly interaction. However, current state‐of‐the‐art (SOTA) methods face two main challenges: (1) they utilize unfinetuned general image‐text encoders for modality fusion, making it difficult to comprehend domain‐specific knowledge in facial attribute editing (dozens of fine‐grained facial attributes such as moustache and lipsticks); (2) they roughly optimize all attributes simultaneously using a cross‐entropy loss, leading to severe mutual interference among attributes. To this end, we propose Disentangled Multimodal Visual Adaptation for CLIP (DMV‐CLIP). First, DMV‐CLIP incorporates learnable context tokens to inject facial domain knowledge into the CLIP model via multimodal prompt learning (MPL). Second, it employs directional contrastive learning (DCL) to disentangle facial attributes and enable precise editing. Finally, DMV‐CLIP utilizes a vision‐language consistency model (VLCM) to maintain identity consistency while ensuring that the generated images strictly adhere to the semantic instructions.