Crop-Masked Vegetation Indices and TerraClimate for District-Level Wheat Yield Prediction in Kazakhstan: SAVI Advantage, Climate Dominance, and Temporal Transferability Limits
Marua Alpysbay, Serik Nurakynov, Anvar Gapparov, Azamat KaldybayevAccurate district-level wheat yield forecasting is critical for Kazakhstan, the world’s seventh-largest wheat exporter. Prior remote-sensing studies typically compute vegetation indices over entire administrative units without isolating cropland, diluting the crop-specific signal and biassing remote-sensing–climate comparisons. A 25-year (2000–2024) dataset was assembled for 149 Kazakh districts (n = 2378 district–year observations, ~390 features), integrating crop-masked Sentinel-2/Landsat-7 optical indices, Sentinel-1 SAR, TerraClimate, and station, soil, and terrain data, and a HistGradientBoosting model was evaluated under both spatial (GroupKFold) and temporal (expanding-window) cross-validation. Ten-metre cropland masking substantially improved index–yield correlations, especially early in the season, and SAVI consistently outperformed NDVI from June onward. The best configuration—crop-masked optical indices with TerraClimate—achieved R2 = 0.646 (RMSE = 0.349 t/ha) under spatial cross-validation, whereas adding SAR yielded no significant gain. Pre-season winter-climate data (January–March) reached about 91% of full-year accuracy, enabling forecasts months before sowing. Critically, temporal cross-validation produced a markedly lower mean R2 = 0.413, a predictability gap (ΔR2 = 0.233) that provides a more representative estimate of operational forecast accuracy. Residuals showed no significant spatial autocorrelation. These results indicate that cropland masking and joint reporting of spatial and temporal cross-validation are valuable for yield prediction in semi-arid continental environments.