DOI: 10.1145/3821640 ISSN: 1551-6857
Breaking GUI agent Actions: Realistic Adversarial Attacks on Multimodal LLMs
Kanghua Mo, Zhengdao Li, Shanshan Huang, Li HuRecent progress in Multimodal Large Language Models (MLLMs) has enabled GUI agents that interpret interface screenshots and execute natural-language instructions. However, their increasing role in human–computer interaction raises security concerns. Existing work shows that MLLMs are highly sensitive to interface perturbations, yet most attacks assume unrealistic capabilities such as direct access to model-ready inputs. In real deployments, attackers cannot control the screenshot pipeline or the non-differentiable preprocessing operations preceding inference, rendering many prior attacks ineffective. We introduce RAA , a
r
ealistic
a
dversarial
a
ttack framework that models the complete screenshot-to-prediction pipeline. To address gradient blockage from non-differentiable preprocessing (e.g., resizing, quantization), we develop a