Wide + Tiles Vision Transformer Framework for Smartphone-Based Grassland Biomass Prediction in Heterogeneous Field Conditions
Ranida Arystanova, Darkhan Zeinulla, Gulnara Kabzhanova, Anuarbek Bissembayev, Roza Bekseitova, Dani Sarsekova, Bakhbayeva Saule, Asset Arystanov, Janay Sagin, Margulan NurtayThis study addresses the issue of accurate and rapid aboveground biomass estimation in rangeland ecosystems, as traditional grazing methods are labor-intensive, while modern remote sensing techniques often require expensive equipment and controlled conditions. The goal of this work is to develop an efficient and accessible approach for biomass estimation of natural pastures based on ground-level RGB images captured with smartphones. For this purpose, a dataset consisting of 1196 field images and corresponding biomass values collected from 40 districts in southern Kazakhstan was used, and a wide + tiles architecture based on the DINOv3 model of Vision Transformer was proposed. The model utilized attention pooling and feature fusion mechanisms to integrate both global and local features, and various preprocessing and augmentation strategies were comparatively examined. Experimental results demonstrated that the proposed method exhibits high accuracy (with the best result being R2 = 0.733, MAE ≈ 0.779 c/ha), where the DINOv3 model showed clear advantages over ConvNeXtV2. Furthermore, the impact of preprocessing strategies was minimal, and the importance of high-resolution images was clearly established. The obtained results show that the proposed method performs consistently under heterogeneous field conditions and allows for reliable biomass estimation without the need for specialized equipment. This makes it a practical tool for monitoring pastures, planning forage supply, and supporting agronomic decision-making.