Improving Entity Understanding for Vision-Language Pre-Training via Active Learning
Qunbo Wang, Sen Zhang, Boxuan Shao, Xize Guo, Jiayong An, Chao Fan, Yuanjun Jing, Junxian Li, Wenjun WuAlthough many researchers use pre-trained models to better solve downstream tasks, further exploration of more effective pre-training methods remains necessary, especially for multi-modal pre-training where high-quality training data is more difficult to obtain. This work aims to improve the knowledge-learning performance in multi-modal pre-training. Some researchers focus on injecting entity knowledge into language pre-trained models based on masked entity model (MEM) training, which masks entities randomly and lets the model recover. These methods cannot guarantee good performance due to the lack of consideration of which entities are more valuable for learning. Moreover, in multi-modal training data, some entities may be unrelated to visual content. In this work, for the vision-language pre-trained model, we propose a Masked Entity Model pre-training method based on Active learning (ActiveMEM). It is designed to actively mask important and informative entities—those that are both informative and uncertain—for the model to recover, thereby encouraging it to extract more valuable knowledge from the data. The proposed method is evaluated using three pre-training datasets and four downstream datasets, and the experimental results demonstrate the effectiveness of our method.