DOI: 10.3390/fishes11070385 ISSN: 2410-3888

Research on Fish Recognition in Complex Backgrounds Using ViT-Enhanced YOLOv11

Xiangshuo He, Shenglong Yang, Wei Wang, Kai Zhu, Shengmao Zhang, Yang Dai, Keji Jiang, Fei Wang

To address the common challenges in fish recognition tasks under complex backgrounds, such as target overlap, occlusion, and chaotic spatial distribution, an improved YOLOv11 recognition model based on the Vision Transformer (ViT) is proposed. Traditional Convolutional Neural Networks (CNNs) and the YOLO series models are limited by their local receptive fields, making it difficult to capture global semantic correlations in dense and heavily occluded fish target detection, which often leads to feature confusion and false detections. By embedding ViT modules at the beginning of the Head and at the end of the Backbone of YOLOv11, the self-attention mechanism of ViT is leveraged to capture global dependencies in the image, re-integrate and enhance multi-scale features from the Backbone and Neck, thus constructing two improved ViT models. Comparative experiments are conducted on the FishRecognition-2025 dataset, which contains 955 high-resolution RGB images covering nine common coastal fish species across four categories: single fish species, multiple classes separated, slight overlap of multiple fish species, and severe overlap of multiple fish species. Under identical training strategies and evaluation metrics, the four models—original YOLOv11, traditional CNN, ViT-Head, and ViT-Backbone—are compared. The results show that the second improved ViT model (with ViT placed at the end of the Backbone) outperformed the first improved model (with ViT placed at the beginning of the Head) in terms of mAP50 and mAP50-95. Moreover, its overall accuracy across the four test data categories (single fish species, multiple classes separated, slight overlap of multiple fish species, and severe overlap of multiple fish species) surpassed that of YOLOv11, CNN, and the first ViT model. Although its accuracy in single fish species and multiple classes separated scenarios was slightly lower than that of the CNN model, it demonstrated significant advantages in scenarios with slight overlap of multiple fish species and severe overlap of multiple fish species. These findings validate the effectiveness of the ViT module in global feature modeling and adaptability to complex backgrounds, suggesting a promising technical direction for future real-time recognition in fishery field operations.

More from our Archive