FLAME: Federated Learning and Aggregated Multi-Model Ensemble for Multi-Class Alzheimer’s Disease Stage Classification from Structured Clinical Data
Karim Gasmi, Lassaad Ben Ammar, Moez Krichen, Ahod AlghuriedBackground/Objectives: The precise identification of Alzheimer’s disease (AD) stages through clinical data is crucial for early diagnosis and suitable therapy. This classification remains troublesome due to overlap in cognitive profiles across different phases of illness progression. This study presents a comprehensive and advanced diagnostic system, termed FLAME, featuring an enhanced federated learning architecture for privacy-preserving multi-institutional implementation. It provides a systematic review of machine learning (ML) and deep learning (DL) models for the classification of five stages of Alzheimer’s disease (AD). The models include cognitively normal (CN), subjective memory complaints (SMC), early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), and Alzheimer’s disease (AD). Methods: Sixteen traditional machine learning models and eleven deep learning architectures—including FT-Transformer and NODE—were evaluated using a structured clinical dataset comprising 362 features. A hybrid ensemble was created at the probability level by combining the two top-performing models, LightGBM and a five-layer DNN. The weights of this ensemble were automatically optimised using a Genetic Algorithm (GA) with Macro-F1 as the fitness criterion, confirmed stable across 30 independent runs (w★=0.5024±0.0001). A federated learning architecture was then established, deploying the DNN across non-IID clients while keeping LightGBM centralised. We examine four distinct aggregation algorithms: FedAvg, FedProx, FedNova, and SCAFFOLD. Results: Among all deep learning architectures, FT-Transformer achieved the highest standalone performance (accuracy = 0.7810, κ = 0.7081). The five-layer deep neural network (DNN) was selected as the DL representative for the hybrid ensemble. LightGBM attained superior machine learning performance (accuracy = 0.8156, κ = 0.7537), confirmed deterministic across 10 seeds. The LightGBM vs. XGBoost difference is not statistically significant (McNemar p=0.4227). The GA-optimised hybrid ensemble (w = 0.685) surpassed both individual baselines across all evaluation metrics. The FedNova hybrid design achieved superior overall performance in federated configurations, surpassing all centralised arrangements in accuracy (accuracy = 0.8213, κ 0.7614). Conclusions: Evolutionary ensemble optimisation combined with federated learning provides a robust, scalable, and privacy-preserving solution for AD stage classification, offering a clinically viable framework for real-world multi-institutional decision-support systems. However, the AD class remains severely under-recalled across all configurations (F1 ≤ 0.21), identifying this as the primary open challenge for clinical translation.