Vision Transformer with Spatial 2D Multi-Channel Tokens
Sirui Zheng, Yu Li, Zhongxiang Zhang, Dequn ZhaoVision Transformer (ViT) has been widely adopted in the computer vision community. However, the standard ViT often contains many parameters, usually performs poorly when trained from scratch on medium-scale datasets, and does not explicitly preserve the local spatial and channel-wise structures within each token. This work proposes a novel model called the Token-Shared Convolutional Projection Vision Transformer (TSCP-ViT). The core idea of TSCP-ViT is to integrate convolutional layers into the multi-head attention mechanism and to apply the same convolutional operation independently to each token, where each token exhibits spatial 2D multi-channel characteristics. In addition, this work introduces a Transformer decoder immediately after each Transformer encoder, enabling the classification tokens to aggregate information from all tokens and be updated using statistical information. Moreover, a trainable Non-Reversing Gate GELU (NRG-GELU) activation is also proposed. Comparative experiments on CIFAR-100, Food-101, and ImageNet100 show that, under comparable parameter counts and without pretraining or knowledge distillation, TSCP-ViT substantially surpasses ViT, outperforms CvT, outperforms ResNet on Food-101, and approaches ResNet on CIFAR-100 and ImageNet100, although with considerably higher FLOPs.