ShanghaiTech University Knowledge Management System
VTR: Bidirectional Video-Textual Transmission Rail for CLIP-based Video Recognition | |
2024-07-19 | |
会议录名称 | 2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME)
![]() |
ISSN | 1945-7871 |
发表状态 | 已发表 |
DOI | 10.1109/ICME57554.2024.10688220 |
摘要 | There are two key issues when transferring vision-language model like CLIP for video recognition: bidirectional video-textual transmission and temporal modeling. To address the issues, we propose a novel framework named Video-Textual Transmission Rail (VTR) which enables bidirectional transmission of visual-textual representations and temporal modeling concurrently. Specifically, Message Rail (MR) tokens are proposed within VTR to realize the bidirectional transfer of not only high-level visual-language knowledge but also low-level knowledge. For temporal modeling, we introduce the Temporal Elite (TE) - a temporal modeling module within VTR, providing VTR with the capability of both short- and long-range temporal modeling under textual supervision thanks to MR tokens. Extensive experiments on popular video datasets (i.e., Kinetics-400, Something-Something-v2, UCF-101 and HMDB-51) demonstrate that VTR achieves state-of-the-art performance in fully-supervised, zero-shot and few-shot video recognition. |
会议录编者/会议主办者 | IEEE Circuits and Systems (CAS) ; IEEE Communications (ComSoc) ; IEEE Computer (CS) ; IEEE Signal Processing (SPS) |
关键词 | Light transmission Television transmission Video analysis Video recording Bidirectional transmission CLIP Key Issues Language model Representation model Temporal models Textual representation Transmission model Video recognition Vision-language model |
会议名称 | 2024 IEEE International Conference on Multimedia and Expo, ICME 2024 |
会议地点 | Niagara Falls, ON, Canada |
会议日期 | 15-19 July 2024 |
URL | 查看原文 |
收录类别 | EI |
语种 | 英语 |
出版者 | IEEE Computer Society |
EI入藏号 | 20244317227837 |
EI主题词 | Visual languages |
EISSN | 1945-788X |
EI分类号 | 1106.1.1 ; 1106.3.1 ; 716.4 Television Systems and Equipment ; 741.1/Optics |
原始文献类型 | Conference article (CA) |
来源库 | IEEE |
文献类型 | 会议论文 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/430455 |
专题 | 信息科学与技术学院_特聘教授组_张晓林组 |
作者单位 | 1.Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai, China 2.University of Chinese Academy of Sciences, Beijing, China 3.ShanghaiTech University, Shanghai, China |
推荐引用方式 GB/T 7714 | Shaoqi Yu,Lili Chen,Xiaolin Zhang,et al. VTR: Bidirectional Video-Textual Transmission Rail for CLIP-based Video Recognition[C]//IEEE Circuits and Systems (CAS), IEEE Communications (ComSoc), IEEE Computer (CS), IEEE Signal Processing (SPS):IEEE Computer Society,2024. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。