Machine Learning-Based Diabetes Risk Prediction via DiaHealth Dataset with Explainable AI and Streamlit Deployment
Samson Adeyemi, Muhammad Zahid Iqbal, Md Golam Muttaquee TalukderThe growing worldwide prevalence of Diabetes Mellitus highlights the urgent need for effective early detection methods to enable prompt intervention. This study develops a machine learning-based decision-support prototype for predicting diabetes risk using health metrics from the DiaHealth dataset, a recently published Bangladeshi open-source dataset for Type 2 diabetes prediction. Five supervised learning algorithms were evaluated: Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbour (KNN), Decision Tree (DT), and Random Forest (RF). Models were assessed across three stages: before feature scaling, after standardisation, and following hyperparameter optimisation via GridSearchCV, using accuracy, precision, recall, and F1-score as evaluation metrics. LR and SVM showed marked improvements after standardisation, consistent with their sensitivity to feature magnitude, whilst tree-based approaches such as DT and RF remained largely unchanged. KNN displayed minimal sensitivity to scaling, which is discussed in relation to the feature distributions of the dataset. Following hyperparameter tuning, RF achieved the highest accuracy of 95%, outperforming all other models. RF predictions were interpreted using Local Interpretable Model-agnostic Explanations (LIME) to promote transparency in model decision-making. The best-performing model was subsequently deployed as an interactive web-based prototype application using Streamlit, providing real-time prediction outputs. These findings demonstrate how preprocessing choices and hyperparameter tuning can differentially affect algorithm performance and illustrate the potential of combining explainable AI with practical deployment for diabetes risk assessment in a research context.