Bloom or Bluff? Benchmarking Vision–Language Models Against Classical Machine Learning for Harmful Algal Bloom Detection from Satellite Imagery
Harsh Deep Singh NarulaIn recent years, there has been growing interest in applying vision–language models (VLMs) to quantitative remote sensing. This study evaluates whether three commercial VLMs (GPT-4o, GPT-5.5, and Claude Sonnet 4.6) can detect and classify the severity of harmful algal blooms (HABs) from Sentinel-2 satellite imagery of western Lake Erie and compares them against classical machine learning classifiers (Random Forest (RF), Support Vector Machine (SVM), and eXtreme Gradient Boosting (XGBoost)) trained on both a three-band red, green, blue (RGB) composite representation of the imagery and a 10-band multi-spectral reflectance representation. Forty bloom events identified from the National Oceanic and Atmospheric Administration (NOAA) Harmful Algal Bloom Operational Forecast System (HAB-OFS) severity assessments were assembled into the evaluation dataset, spanning seven bloom seasons (2019–2025). For binary bloom detection, the VLMs did not match the classical RGB classifiers; their F1 scores (0.69–0.75) fell below the best RGB classifier (Random Forest, 0.76) and below a trivial always-present baseline (F1 = 0.77), and they carried false positive rates of 73–93% on bloom-absent images, against 27–40% for the RGB classifiers. The VLMs reached high recall by labeling most scenes as bloom-positive, which makes them operationally unreliable in this configuration. For severity classification, the VLMs assigned 60–70% of their predictions to the “moderate” category regardless of actual conditions and identified at most one of the two severe blooms, whereas the classical classifiers tracked the ground-truth distribution and delivered two to nearly three times the exact-match accuracy (0.44–0.59 vs. 0.20–0.225). The strongest method across all metrics was the multi-spectral SVM (F1 = 0.833, false positive rate 27%, accuracy 0.795). Switching the same SVM from RGB to multi-spectral features raised accuracy from 0.675 to 0.795, a 12-percentage-point gain that measures the spectral information carried by red-edge and shortwave infrared bands that are accessible through multi-spectral sensors but unavailable to standard VLM vision encoders. Feature-importance analysis showed that the multi-spectral classifiers ranked chlorophyll-specific indices, the Normalized Difference Chlorophyll Index (NDCI) and the Floating Algae Index (FAI), among their top predictors, the same signatures used in established operational algorithms, while the RGB classifiers relied on red-channel variability and green-dominant pixel fractions because RGB inputs cannot compute those indices. Two compounded limitations therefore constrain off-the-shelf VLMs for aquatic remote sensing: the limited spectral information available through standard RGB channels and a mismatch between the land-dominated training distributions of these models and aquatic optical conditions. Domain-specific classifiers operating on multi-spectral data remain the more suitable tools for continued development of HAB monitoring and water-quality retrieval.