Self-learning Canonical Space for Multi-view 3D Human Pose Estimation

doi:arXiv:2403.12440

	Self-learning Canonical Space for Multi-view 3D Human Pose Estimation
	Li, Xiaoben1,2 ; Meng, Mancheng 2; Wu, Ziyan 2; Chen, Terrence 2; Yang, Fan 2; Shen, Dinggang1,2
	2024-03-29
状态	已发表
摘要	Multi -view 3D human pose estimation is naturally superior to single view one, benefiting from more comprehensive information provided by images of multiple views. The information includes camera poses, 2D/3D human poses, and 3D geometry. However, the accurate annotation of these information is hard to obtain, making it challenging to predict accurate 3D human pose from multi -view images. To deal with this issue, we propose a fully self -supervised framework, named cascaded multi -view aggregating network (CMANet), to construct a canonical parameter space to holistically integrate and exploit multi -view information. In our framework, the multi -view information is grouped into two categories: 1) intra-view information (i.e., camera pose, projected 2D human pose, view -dependent 3D human pose), 2) inter -view information (i.e., cross -view complement and 3D geometry constraint). Accordingly, CMANet consists of two components: intra-view module (IRV) and interview module (IEV). IRV is used for extracting initial camera pose and 3D human pose of each view; IEV is to fuse complementary pose information and cross -view 3D geometry for a final 3D human pose. To facilitate the aggregation of the intra- and inter -view, we define a canonical parameter space, depicted by per -view camera pose and human pose and shape parameters (θ and β) of SMPL model, and propose a two -stage learning procedure. At first stage, IRV learns to estimate camera pose and view -dependent 3D human pose supervised by confident output of an off -the -shelf 2D keypoint detector. At second stage, IRV is frozen and IEV further refines the camera pose and optimizes the 3D human pose by implicitly encoding the cross -view complement and 3D geometry constraint, achieved by jointly fitting predicted multi -view 2D keypoints. The proposed framework, modules, and learning strategy are demonstrated to be effective by comprehensive experiments and CMANet is superior to state-of-the-art methods in extensive quantitative and qualitative analysis.
关键词	Human Pose Estimation Multi-view Self-learning
DOI	arXiv:2403.12440
相关网址	查看原文
出处	Arxiv
WOS记录号	PPRN:88240541
WOS类目	Computer Science, Software Engineering
文献类型	预印本
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/372939
专题	信息科学与技术学院_硕士生生物医学工程学院_PI研究组_沈定刚组
通讯作者	Yang, Fan
作者单位	1.ShanghaiTech Univ, Shanghai, Peoples R China 2.United Imaging Intelligence, Shanghai, Peoples R China
推荐引用方式 GB/T 7714	Li, Xiaoben,Meng, Mancheng,Wu, Ziyan,et al. Self-learning Canonical Space for Multi-view 3D Human Pose Estimation. 2024.