DOI: 10.18466/cbayarfbe.1766229 ISSN: 1305-130X

A Genetic Algorithm-Enhanced Method for Missing Value Imputation in Healthcare Datasets

Hatice Nizam Özoğur, Zeynep Orman
In healthcare datasets, imbalanced class distributions and missing data pose significant challenges to the performance and stability of machine learning models, thereby hindering accurate analysis and disease diagnosis. Addressing these challenges is crucial for improving both the precision and reliability of healthcare data analysis. This paper proposes a novel preprocessing framework specifically designed for healthcare datasets to mitigate issues related to incomplete data and class imbalance. The framework introduces a new imputation method, GA-MICE, which enhances the Multiple Imputation by Chained Equations (MICE) technique using a Genetic Algorithm (GA) to improve the accuracy of handling missing data. Additionally, the framework incorporates the GASMOTEPSO_ENN method, which combines the Synthetic Minority Oversampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms with GA and Particle Swarm Optimization (PSO) heuristics to effectively address class imbalance. After preprocessing, six machine learning classifiers are employed to categorize individuals as either patients or healthy subjects. The model's performance is evaluated using multiple metrics, including accuracy, precision, recall, F1-score, Matthews Correlation Coefficient (MCC), and Area Under the Curve (AUC). Experimental results demonstrate the effectiveness of the proposed approach in managing missing data and addressing class imbalance, achieving performance close to or exceeding existing methodologies reported in the literature.

More from our Archive