Robustness of Large Vision Language Model Features Under Wireless Channel Degradation for Medical Visual Question Answering
Merve Güllü, Necaattin BarışçıDeploying medical visual question answering (VQA) systems over wireless networks introduces a fundamental challenge: channel-induced image degradation may corrupt the visual representations extracted by large vision-language models (VLMs), leading to unreliable diagnostic decisions. We investigate the robustness of frozen LLaVA-1.6, BLIP-2, and BioViL-T hidden-state features under additive white Gaussian noise (AWGN), Rayleigh fading, and six combined JPEG-compression-plus-channel conditions (quality factors q∈{20,50,70}) across signal-to-noise ratios (SNRs) from −5 to +20 dB. A lightweight MLP classifier is trained exclusively on clean features and evaluated on channel-degraded features, enabling controlled analysis of representation robustness without retraining. We introduce the Feature Robustness Score (FRS), defined as the difference between cosine similarity and normalized L2 drift of clean versus degraded features, together with a validation-set FRS threshold analysis as a label-free retraining criterion. A wavelet sub-band energy analysis further characterizes the spectral distribution of channel-induced feature drift. Experiments on PathVQA and VQA-RAD reveal four key findings: (1) LLaVA-1.6 features maintain cosine similarity above 0.98 across all eight channel conditions and all SNR levels, with statistically significant MLP gains at every tested point (p<0.05, McNemar’s test); (2) BLIP-2 and BioViL-T features are less stable but still support consistent MLP improvements, with BioViL-T performing competitively on VQA-RAD, suggesting domain alignment matters; (3) JPEG compression quality (q=20,50,70) has negligible impact on feature drift, establishing VLM features as JPEG quality-invariant; and (4) wavelet analysis confirms that channel noise primarily affects high-frequency detail bands while preserving low-frequency semantic content.