MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models

doi:arXiv:2408.08464

	MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
	Weng, Fenghua; Xu, Yue; Fu, Chengyan; Wang, Wenjie
	2024-08-16
会议录名称	2025 ASSOCIATION FOR THE ADVANCEMENT OF ARTIFICIAL INTELLIGENCE
发表状态	已发表
DOI	arXiv:2408.08464
摘要	As deep learning advances, Large Language Models (LLMs) and their multimodal counterparts, Vision-Language Models (VLMs), have shown exceptional performance in many real-world tasks. However, VLMs face significant security challenges, such as jailbreak attacks, where attackers attempt to bypass the model's safety alignment to elicit harmful responses. The threat of jailbreak attacks on VLMs arises from both the inherent vulnerabilities of LLMs and the multiple information channels that VLMs process. While various attacks and defenses have been proposed, there is a notable gap in unified and comprehensive evaluations, as each method is evaluated on different dataset and metrics, making it impossible to compare the effectiveness of each method. To address this gap, we introduce textit{MMJ-Bench}, a unified pipeline for evaluating jailbreak attacks and defense techniques for VLMs. Through extensive experiments, we assess the effectiveness of various attack methods against SoTA VLMs and evaluate the impact of defense mechanisms on both defense effectiveness and model utility for normal tasks. Our comprehensive evaluation contribute to the field by offering a unified and systematic evaluation framework and the first public-available benchmark for VLM jailbreak research. We also demonstrate several insightful findings that highlights directions for future studies.
会议举办国	United States
会议录编者/会议主办者	ASSOCIATION FOR THE ADVANCEMENT OF ARTIFICIAL INTELLIGENCE
关键词	Vision-Language Model Jailbreak Attack Adversarial Detection
会议名称	39th AAAI Conference on Artificial Intelligence, AAAI 2025
会议地点	Philadelphia, Pennsylvania, USA
会议日期	February 25 – March 4, 2025
URL	查看原文
收录类别	EI
语种	英语
WOS类目	Computer Science, Information Systems
WOS记录号	PPRN:91463304
文献类型	会议论文
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/415564
专题	信息科学与技术学院_硕士生信息科学与技术学院_博士生信息科学与技术学院_PI研究组_王雯婕组
通讯作者	Wang, Wenjie
作者单位	ShanghaiTech Univ, Shanghai, Peoples R China
第一作者单位	上海科技大学
通讯作者单位	上海科技大学
第一作者的第一单位	上海科技大学
推荐引用方式 GB/T 7714	Weng, Fenghua,Xu, Yue,Fu, Chengyan,et al. MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models[C]//ASSOCIATION FOR THE ADVANCEMENT OF ARTIFICIAL INTELLIGENCE,2024.