Communication Efficient Distributed Bayesian Cluster Learning
Yilun Huang, Sounak ChakrabortyABSTRACT
This paper introduces two novel, communication‐efficient Bayesian frameworks, FLamb and OSCLamb, for clustering high‐dimensional data in federated learning (FL) settings. Traditional clustering methods often struggle with the non‐IID data and privacy constraints inherent in distributed environments. Our proposed methods extend the Latent Mixture for Bayesian (Lamb) model to address these challenges, enabling robust dimension reduction and variable selection without sharing raw data. FLamb is an iterative FL algorithm where a central server aggregates sufficient statistics from distributed sites to build a global consensus model. While generally more accurate, its performance can be sensitive to the number of participating sites and communication overhead. In contrast, OSCLamb is a communication‐efficient, single‐round decentralized framework that uses peer‐to‐peer consensus averaging, significantly reducing latency and proving more robust in settings with extreme data heterogeneity. Our simulation studies demonstrate the trade‐offs between the methods, with FLamb achieving higher accuracy in less heterogeneous environments and OSCLamb offering superior speed and stability under challenging conditions. We validate our approaches on two real‐world high‐dimensional datasets, a single‐cell RNA sequencing dataset and an EEG dataset, where both methods demonstrate compelling clustering performance. A key advantage of our Bayesian approach, particularly FLamb, is the ability to provide comprehensive posterior uncertainty quantification for the cluster structure, offering more interpretable and reliable results in decentralized analyses.