Estimation of PM2.5 Concentration Based on PSO-Optimized Machine Learning Models and SHAP Analysis: A Case Study of Wuhan, Hubei Province
Qing Li, Junfu FanPM2.5 is a major air pollutant that threatens urban air quality and public health. Its concentration is influenced by both meteorological conditions and air pollutants, exhibiting complex nonlinear and temporal characteristics. Traditional statistical methods are limited in their ability to model complex relationships among environmental variables, while machine learning models still require improvements in hyperparameter optimization and interpretability. Therefore, developing an accurate and interpretable PM2.5 estimation model remains an important research objective. This study used daily air-quality and meteorological data collected in Wuhan from 2016 to 2025 to develop six machine learning models: Decision Tree (DT), Random Forest (RF), XGBoost, LightGBM, Support Vector Machine (SVM), and Multilayer Perceptron (MLP). The Particle Swarm Optimization (PSO) algorithm was employed to optimize the hyperparameters of these models. By comparing the root mean square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE) of each model on both the training and test sets, the PSO-MLP model was identified as the best-performing model. Furthermore, the Shapley Additive Explanations (SHAP) method was applied to perform both global and local interpretation analyses of the best-performing model. The results indicate that the PSO-MLP model achieved the highest estimation performance among all evaluated models, with an R2 value of 0.746 on the test set. SHAP analysis revealed that CO, Temperature (Temp), and NO2 were the most influential predictors, while all variables exhibited distinct nonlinear relationships with PM2.5 concentration. These findings may contribute to PM2.5 concentration estimation, air-quality management, and environmental decision-making.