Synthetic Data Augmentation for Robust Classification of Diabetic vs. Non-Diabetic Blood FTIR Spectra
Ahmed Fadlelmoula, Kirill N. Boldyrev, Margarida Gonçalves, Helena Torres, Susana O. Catarino, Graça Minas, Vitor CarvalhoEarly detection of diabetes mellitus (DM) is essential for preventing disease progression and improving clinical outcomes. However, developing robust machine learning (ML) models for diabetes diagnosis is often constrained by limited data availability, privacy regulations, and challenges with data sharing. This study investigates a privacy-preserving synthetic data augmentation framework for classifying diabetic and non-diabetic blood serum samples using Fourier Transform Infrared (FTIR) spectroscopy. Two deep generative approaches, Autoencoders (AEs) and Generative Adversarial Networks (GANs), were evaluated for their ability to generate realistic synthetic FTIR spectra while preserving the statistical and biochemical characteristics of the original dataset. Synthetic datasets generated by the AE and GAN models were assessed using six ML classifiers: Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Gradient Boosting (GB), Logistic Regression (LoR), and Decision Tree (DT). Model performance was evaluated using accuracy, precision, recall, F1-score, Receiver Operating Characteristic (ROC) curves, and Area Under the Curve (AUC). Results showed that AE-generated spectra retained stronger discriminative characteristics and were more easily distinguished from the original spectra, whereas GAN-generated spectra exhibited lower classifier separability, suggesting closer alignment with the original data distribution and greater realism for privacy-oriented data augmentation. Correlation analysis demonstrated high spectral fidelity for both approaches. Compared with the original spectra, AE-generated spectra achieved r = 0.9990 and R2 = 0.9999, whereas GAN-generated spectra achieved r = 0.9982 and R2 = 0.9965. The most prominent diabetes related spectral variations were observed in the carbohydrate (1000–1200 cm−1), Amide I (~1650 cm−1), and lipid-associated (3000–3500 cm−1) regions. To explore the transferability of the proposed framework, a preliminary experimental feasibility study was conducted using independently acquired whole blood FTIR spectra. The generated spectra showed strong agreement with the measured whole blood spectra, demonstrating the potential applicability of the framework under alternative sampling conditions. Because the experimental cohort included only one diabetic volunteer, this analysis was intended solely as a proof-of-concept assessment of spectral feasibility and methodological transferability, rather than as a validation of diabetes classification performance. Overall, the findings demonstrate that synthetic data generation can effectively augment limited FTIR datasets while preserving privacy and key spectral characteristics. The proposed framework provides a promising foundation for privacy-aware biomedical data augmentation and future development of robust FTIR diabetes screening systems. The results should be interpreted as methodological evidence of feasibility and synthetic data utility rather than as evidence of clinical diagnostic readiness, as the serum dataset remains modest in size and the independent whole-blood experiment was intentionally exploring.