DOI: 10.1145/3820888 ISSN: 1551-6857
Mobiflip: Information-Bottleneck-Guided Minimal Federated Adaptation for Cross-Modal Models
Zishan Xu, Jiansen Zhang, Wei Chen, Jueting Liu, Tingting Xu, Zehua Wang, Abdulmotaleb El Saddik
Cross-modal federated learning is constrained by bandwidth and on-device compute. We present Mobiflip: a minimalist strategy that freezes a lightweight backbone and communicates only a channel-wise
\(1\times 1\)
scaling adapter appended to the image branch. Guided by the Information Bottleneck, we prove that under common distributional and linear-encoder surrogates, per-channel scaling attains the linear optimum; coupled with the directional geometry of (Mobile)CLIP, the adapter is, in first-order approximation, an optimal preconditioner of the cosine-similarity space—preserving discriminative directions while compressing redundancy and suppressing inter-client drift. We adopt MobileCLIP as a mobile-friendly backbone to jointly minimize compute and communication. On CIFAR-10/100 and medical imaging, a single aggregation already yields stable Bacc; each round transmits only about 0.7% of backbone parameters with
\(>92\%\)
reduction in communication. Compared with recent federated multimodal/large-model methods, Mobiflip maintains—or even improves—accuracy under ultra-low communication.