消息
×
loading..
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization
2024-09
会议录名称THE THIRTY-EIGHTH ANNUAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS
发表状态正式接收
摘要

Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies, and providing the agent with enhanced exploration capabilities. However, existing works mainly focus on the application of diffusion policies in offline RL, while their incorporation into online RL is less investigated. The training objective of the diffusion model, known as the variational lower bound, cannot be optimized directly in online RL due to the unavailability of 'good' actions. This leads to difficulties in conducting diffusion policy improvement. To overcome this, we propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO). Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. To fulfill these conditions, the Q-weight transformation functions are introduced for general scenarios. Additionally, to further enhance the exploration capability of the diffusion policy, we design a special entropy regularization term. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions. Consequently, the QVPO algorithm leverages the exploration capabilities and multimodality of diffusion policies, preventing the RL agent from converging to a sub-optimal policy. To verify the effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance on both cumulative reward and sample efficiency.

会议录编者/会议主办者Amir Globerson
会议名称The Thirty-Eighth Annual Conference on Neural Information Processing Systems
URL查看原文
文献类型会议论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/452435
专题信息科学与技术学院_PI研究组_汪婧雅组
信息科学与技术学院_PI研究组_虞晶怡组
信息科学与技术学院_硕士生
信息科学与技术学院_博士生
信息科学与技术学院_PI研究组_石野组
通讯作者Shi Y(石野)
作者单位
1.ShanghaiTech University
2.Shanghai Jiao Tong University
3.MoE Key Laboratory of Intelligent Perception and Human Machine Collaboration
第一作者单位上海科技大学
通讯作者单位上海科技大学
第一作者的第一单位上海科技大学
推荐引用方式
GB/T 7714
Ding ST,Hu K,Zhang ZH,et al. Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization[C]//Amir Globerson,2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Ding ST(丁枢桐)]的文章
[Hu K(胡轲)]的文章
[Zhang ZH(章震豪)]的文章
百度学术
百度学术中相似的文章
[Ding ST(丁枢桐)]的文章
[Hu K(胡轲)]的文章
[Zhang ZH(章震豪)]的文章
必应学术
必应学术中相似的文章
[Ding ST(丁枢桐)]的文章
[Hu K(胡轲)]的文章
[Zhang ZH(章震豪)]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。