Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning
Jianlin Qiu, Depeng Gao, Shuxi Chen, Wenjie LiuCrop leaf diseases cause 10–40% annual yield losses, yet timely field diagnosis remains difficult. Vision-language models (VLMs) lift recognition accuracy with rich textual descriptions, but multimodal pipelines are too slow for real-time field use because they require text processing at inference. We present MTL-AWL, a framework built on a training–inference asymmetry: VLM text serves as privileged training-time supervision, and two coupled mechanisms—one retaining VLM semantics in the image encoder and one exploiting them—enable image-only deployment at multimodal accuracy. A modal-dropout strategy (p=0.6) intermittently masks the VLM text sequence during training, forcing the image encoder to retain cross-modal representations independently. An adaptive multi-task loss jointly optimizes InfoNCE contrastive alignment, attention diversity, and modality consistency under learnable softmax weights, consistently converging to a dominant contrastive weight (55% on soybean, 68% on PlantDoc)—identifying cross-modal alignment as the primary mechanism of VLM knowledge transfer. At inference, the model reaches 818 FPS (3.7× faster than multimodal methods) at only 0.41% accuracy cost, attaining 99.30%/98.89% (multimodal/image-only) on soybean and 72.65%/68.80% on PlantDoc—compact enough for real-time, offline field screening.