DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
2025-02-17
状态已发表
摘要Large Language Models (LLMs) are widely applied in decision making, but their deployment is threatened by jailbreak attacks, where adversarial users manipulate model behavior to bypass safety measures. Existing defense mechanisms, such as safety fine-tuning and model editing, either require extensive parameter modifications or lack precision, leading to performance degradation on general tasks, which is unsuitable to post-deployment safety alignment. To address these challenges, we propose DELMAN (Dynamic Editing for LLMs JAilbreak DefeNse), a novel approach leveraging direct model editing for precise, dynamic protection against jailbreak attacks. DELMAN directly updates a minimal set of relevant parameters to neutralize harmful behaviors while preserving the model's utility. To avoid triggering a safe response in benign context, we incorporate KL-divergence regularization to ensure the updated model remains consistent with the original model when processing benign queries. Experimental results demonstrate that DELMAN outperforms baseline methods in mitigating jailbreak attacks while preserving the model's utility, and adapts seamlessly to new attack instances, providing a practical and efficient solution for post-deployment model protection.
语种英语
DOIarXiv:2502.11647
相关网址查看原文
出处Arxiv
收录类别PPRN.PPRN
WOS记录号PPRN:121699384
WOS类目Computer Science, Artificial Intelligence ; Computer Science, Information Systems
文献类型预印本
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/514094
专题信息科学与技术学院_硕士生
信息科学与技术学院_PI研究组_王雯婕组
通讯作者Wang, Wenjie
作者单位
1.ShanghaiTech Univ, Shanghai, Peoples R China
2.Zhejiang Univ, Hangzhou, Peoples R China
3.Tsinghua Univ, Beijing, Peoples R China
推荐引用方式
GB/T 7714
Wang, Yi,Weng, Fenghua,Yang, Sibei,et al. DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing. 2025.
条目包含的文件
条目无相关文件。
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Wang, Yi]的文章
[Weng, Fenghua]的文章
[Yang, Sibei]的文章
百度学术
百度学术中相似的文章
[Wang, Yi]的文章
[Weng, Fenghua]的文章
[Yang, Sibei]的文章
必应学术
必应学术中相似的文章
[Wang, Yi]的文章
[Weng, Fenghua]的文章
[Yang, Sibei]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。