VTR: Bidirectional Video-Textual Transmission Rail for CLIP-based Video Recognition
2024-07-19
会议录名称2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME)
ISSN1945-7871
发表状态已发表
DOI10.1109/ICME57554.2024.10688220
摘要

There are two key issues when transferring vision-language model like CLIP for video recognition: bidirectional video-textual transmission and temporal modeling. To address the issues, we propose a novel framework named Video-Textual Transmission Rail (VTR) which enables bidirectional transmission of visual-textual representations and temporal modeling concurrently. Specifically, Message Rail (MR) tokens are proposed within VTR to realize the bidirectional transfer of not only high-level visual-language knowledge but also low-level knowledge. For temporal modeling, we introduce the Temporal Elite (TE) - a temporal modeling module within VTR, providing VTR with the capability of both short- and long-range temporal modeling under textual supervision thanks to MR tokens. Extensive experiments on popular video datasets (i.e., Kinetics-400, Something-Something-v2, UCF-101 and HMDB-51) demonstrate that VTR achieves state-of-the-art performance in fully-supervised, zero-shot and few-shot video recognition.

会议录编者/会议主办者IEEE Circuits and Systems (CAS) ; IEEE Communications (ComSoc) ; IEEE Computer (CS) ; IEEE Signal Processing (SPS)
关键词Light transmission Television transmission Video analysis Video recording Bidirectional transmission CLIP Key Issues Language model Representation model Temporal models Textual representation Transmission model Video recognition Vision-language model
会议名称2024 IEEE International Conference on Multimedia and Expo, ICME 2024
会议地点Niagara Falls, ON, Canada
会议日期15-19 July 2024
URL查看原文
收录类别EI
语种英语
出版者IEEE Computer Society
EI入藏号20244317227837
EI主题词Visual languages
EISSN1945-788X
EI分类号1106.1.1 ; 1106.3.1 ; 716.4 Television Systems and Equipment ; 741.1/Optics
原始文献类型Conference article (CA)
来源库IEEE
文献类型会议论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/430455
专题信息科学与技术学院_特聘教授组_张晓林组
作者单位
1.Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai, China
2.University of Chinese Academy of Sciences, Beijing, China
3.ShanghaiTech University, Shanghai, China
推荐引用方式
GB/T 7714
Shaoqi Yu,Lili Chen,Xiaolin Zhang,et al. VTR: Bidirectional Video-Textual Transmission Rail for CLIP-based Video Recognition[C]//IEEE Circuits and Systems (CAS), IEEE Communications (ComSoc), IEEE Computer (CS), IEEE Signal Processing (SPS):IEEE Computer Society,2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Shaoqi Yu]的文章
[Lili Chen]的文章
[Xiaolin Zhang]的文章
百度学术
百度学术中相似的文章
[Shaoqi Yu]的文章
[Lili Chen]的文章
[Xiaolin Zhang]的文章
必应学术
必应学术中相似的文章
[Shaoqi Yu]的文章
[Lili Chen]的文章
[Xiaolin Zhang]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。