A Feature Map is Worth a Video Frame: Rethinking Convolutional Features for Visible-Infrared Person Re-identificationQiaolin He, Zhijie Zheng, Haifeng Hu
- Computer Networks and Communications
- Hardware and Architecture
Visible-Infrared Person Re-identification (VI-ReID) aims to search for the identity of the same person across different spectra. The feature maps obtained from the convolutional layers are generally used for loss calculation in the later stages of the model in VI-ReID, but their role in the early and middle stages of the model remains unexplored. In this paper, we propose a novel Rethinking Convolutional Features (ReCF) approach for VI-ReID. ReCF consists of two modules: Middle Feature Generation (MFG), which utilizes the feature maps in the early stage to reduce significant modality gap, and Temporal Feature Aggregation (TFA), which uses the feature maps in the middle stage to aggregate multi-level features for enlarging the receptive field. MFG generates middle modality features in the form of a learnable convolution layer as a bridge between RGB and IR modalities, which is more flexible than using fixed-parameter grayscale images and yields a better middle modality to further reduce the modality gap. TFA first treats the convolution process as a video sequence, and the feature map of each convolution layer can be considered a worthwhile video frame. Based on this, we can obtain a multi-level receptive field and a temporal refinement. In addition, we introduce a color-unrelated loss and a modality-unrelated loss to constrain the modality features for providing a common feature representation space. Experimental results on the challenging VI-ReID datasets demonstrate that our proposed method achieves state-of-the-art performance.