DOI: 10.1145/3821418 ISSN: 2157-6904
Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models
Abdulkadir Erol, Trilok Padhi, Agnik Saha, Mehmet Emin Aktas, Ugur Kursuncu
The rapid advancement of Large Vision-Language Models (LVLMs) has enhanced their capabilities from content creation to productivity enhancement. Despite their innovative potential, LVLMs exhibit vulnerabilities, especially in generating potentially toxic or unsafe responses. Malicious actors can exploit these vulnerabilities to propagate toxic content using strategically crafted prompts without fine-tuning or compute-intensive procedures. Despite ongoing red-teaming efforts to identify and mitigate these risks, the exploration of LVLM vulnerabilities remains nascent and yet to be fully addressed in a systematic approach. This study systematically examines the vulnerabilities of open and closed-weight LVLMs, including
LLaVA
,
InstructBLIP
,
Fuyu
,
Qwen
,
DeepSeek
,
Gemini
,
GPT
, and
Grok
, using adversarial prompting strategies informed by social theories to simulate real-world social manipulation tactics. Our findings show that (i) toxicity and insult are the most prevalent behaviors, with mean toxicity scores 19.32% and 12.36%, respectively; (ii)
Gemini-2.0-Flash
,
LLaVA-v1.6-Vicuna-13B
, and
Grok-2-Vision-1212
are the most vulnerable models. Their toxic response rates reach 46.93%, 23.81%, and 17.98%, respectively, while insult response rates reach 47.94%, 14.62%, 12.27%, respectively; (iii) prompting strategies incorporating
dark humor
and
multimodal toxic prompt completion
significantly elevate these vulnerabilities. Despite extensive safety alignment efforts, models still generate content with varying degrees of toxicity when prompted with adversarial inputs, highlighting the urgent need for enhanced safety mechanisms and robust guardrails in LVLM development.