Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training
2025-02-17
状态已发表
摘要Safety alignment is critical in pre-training large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose $textit{Adversary-aware DPO (ADPO)}$, a novel training framework that explicitly considers adversarial. $textit{Adversary-aware DPO (ADPO)}$ integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. $textit{ADPO}$ introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversarial-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, $textit{ADPO}$ ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that $textit{ADPO}$ outperforms baselines in the safety alignment and general utility of VLMs.
语种英语
DOIarXiv:2502.11455
相关网址查看原文
出处Arxiv
收录类别PPRN.PPRN
WOS记录号PPRN:121697506
WOS类目Computer Science, Information Systems
文献类型预印本
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/514092
专题信息科学与技术学院_硕士生
信息科学与技术学院_PI研究组_王雯婕组
通讯作者Wang, Wenjie
作者单位
1.ShanghaiTech Univ, Shanghai, Peoples R China
2.Sun Yat Sen Univ, Guangzhou, Peoples R China
3.Huazhong Univ Sci & Technol, Wuhan, Peoples R China
4.Tsinghua Univ, Beijing, Peoples R China
推荐引用方式
GB/T 7714
Weng, Fenghua,Lou, Jian,Feng, Jun,et al. Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training. 2025.
条目包含的文件
条目无相关文件。
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Weng, Fenghua]的文章
[Lou, Jian]的文章
[Feng, Jun]的文章
百度学术
百度学术中相似的文章
[Weng, Fenghua]的文章
[Lou, Jian]的文章
[Feng, Jun]的文章
必应学术
必应学术中相似的文章
[Weng, Fenghua]的文章
[Lou, Jian]的文章
[Feng, Jun]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。