DOI: 10.1142/s0219467824500608 ISSN: 0219-4678
CMVT: ConVit Transformer Network Recombined with Convolutional Layer
Chunxia Mao, Jun Li, Tao Hu, Xuanyu ZhaoVision transformers are deep neural networks applied to image classification based on a self-attention mechanism and can process data in parallel. Aiming at the structural loss of Vision transformers, this paper combines ConViT and Convolutional Neural Network (CNN) and proposes a new model Convolution Meet Vision Transformers (CMVT). This model adds a convolution module to the ConViT network to solve the structural loss of the transformer. By adding hierarchical data representation, the ability to gradually extract more image classification features is improved. We have conducted comparative experiments on multiple dataset, and all of them have been enhanced to improve the efficiency and performance of the model.