Dual-Direction Fine-Tuning of Pre-Trained Text-to-Image Models with Attentive Attribution Alignment

doi:10.3390/app16136441

DOI: 10.3390/app16136441 ISSN: 2076-3417

Dual-Direction Fine-Tuning of Pre-Trained Text-to-Image Models with Attentive Attribution Alignment

Xinyu Wang, Junsheng Luan, Wei Xing, Huaizhong Lin, Lei Zhao

Recent large pre-trained text-to-image models possess massive prior knowledge, which can generate high-quality images with a simple prompt. However, when we need to generate a personalized object with a specific attribution description, namely, an identity V*, it cannot generate images that match the desired identity. Existing personalization methods for text-to-image models have the following three problems: (1) low identity fidelity, (2) low prompt fidelity and (3) low quality. To address the above issues, we propose a dual-direction fine-tuning method that guides the large pre-trained text-to-image model to generate images corresponding to the input prompt. Specifically, we input a class label into a large pre-trained text-to-image model to generate class-generalized images and caption these images using a Bootstrapping Language–Image Pre-training (BLIP) image-captioning model, obtaining class-generalized prompts. The two fine-tuning directions are ID-strengthen and ID-weaken directions. For the ID-strengthen direction, we append up-weighted V* at the beginning of a class-generalized prompt to obtain an ID-strengthen prompt, then we use the reference images of V* and the ID-strengthen prompt to fine-tune the pre-trained model. For the ID-weaken direction, we append down-weighted V*, obtain an ID-weaken prompt, and fine-tune the pre-trained model. In addition, we propose an attentive attribution alignment strategy to align the semantic information of the weighted prompts to the class-generalized prompts. Qualitative and quantitative experiments show that our method improves identity fidelity and prompt fidelity while maintaining high visual quality.

Outline

Dual-Direction Fine-Tuning of Pre-Trained Text-to-Image Models with Attentive Attribution Alignment

More from our Archive