DOI: 10.1145/3821640 ISSN: 1551-6857

Breaking GUI agent Actions: Realistic Adversarial Attacks on Multimodal LLMs

Kanghua Mo, Zhengdao Li, Shanshan Huang, Li Hu

Recent progress in Multimodal Large Language Models (MLLMs) has enabled GUI agents that interpret interface screenshots and execute natural-language instructions. However, their increasing role in human–computer interaction raises security concerns. Existing work shows that MLLMs are highly sensitive to interface perturbations, yet most attacks assume unrealistic capabilities such as direct access to model-ready inputs. In real deployments, attackers cannot control the screenshot pipeline or the non-differentiable preprocessing operations preceding inference, rendering many prior attacks ineffective. We introduce RAA , a

r
ealistic
a
dversarial
a
ttack framework that models the complete screenshot-to-prediction pipeline. To address gradient blockage from non-differentiable preprocessing (e.g., resizing, quantization), we develop a dual-branch differentiable preprocessing module that restores gradient flow while remaining faithful to the actual inference path. To improve robustness against query phrasing changes, we further incorporate a task-level semantic variation mechanism that jointly optimizes over paraphrased task variants. Experiments on Qwen2.5-VL-3B/7B-Instruct and GUI-Owl-7B show that RAA consistently outperforms prior attacks, improving attack success rates by an average of 30% under localized perturbations and a 10.7% under global perturbations, while maintaining strong effectiveness against standard defenses such as resizing, compression, and noise. Ablation studies highlight the roles of perturbation region and semantic diversity. Our framework establishes a practical threat model for GUI agents and offers a foundation for future robustness and security research.

More from our Archive