| |||||||
ShanghaiTech University Knowledge Management System
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space | |
2024-03-01 | |
会议录名称 | ARXIV |
ISSN | 1063-6919 |
页码 | 1596-1605 |
发表状态 | 已发表 |
DOI | arXiv:2403.00691 |
摘要 | Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motion-to-video. |
关键词 | Contrastive Learning Human engineering Metadata Text processing Cross-modal Cross-modal retrieval Embeddings High quality Human motion data Human motions Motion alignment Motion retrieval Multi-modal learning Video motion |
会议地点 | Seattle, WA, USA |
会议日期 | 16-22 June 2024 |
URL | 查看原文 |
收录类别 | EI |
语种 | 英语 |
WOS类目 | Computer Science, Artificial Intelligence ; Computer Science, Software Engineering |
WOS记录号 | PPRN:88004002 |
出版者 | IEEE Computer Society |
EI入藏号 | 20250917941978 |
EI主题词 | Information retrieval |
EI分类号 | 101.5 Ergonomics and Human Factors Engineering ; 1101.2 Machine Learning ; 1106.2 Data Handling and Data Processing ; 903.1 Information Sources and Analysis ; 903.3 Information Retrieval and Use |
原始文献类型 | Conference article (CA) |
来源库 | IEEE |
文献类型 | 会议论文 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/372891 |
专题 | 信息科学与技术学院_硕士生 创意与艺术学院_PI研究组(P)_田政组 |
通讯作者 | Tian, Zheng |
作者单位 | 1.ShanghaiTech Univ, Shanghai, Peoples R China 2.Chinese Acad Sci, Shenzhen Inst Adv Technol, Beijing, Peoples R China |
第一作者单位 | 上海科技大学 |
通讯作者单位 | 上海科技大学 |
第一作者的第一单位 | 上海科技大学 |
推荐引用方式 GB/T 7714 | Yin, Kangning,Zou, Shihao,Ge, Yuxuan,et al. Tri-Modal Motion Retrieval by Learning a Joint Embedding Space[C]:IEEE Computer Society,2024:1596-1605. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。