HaViT: Hybrid-Attention Based Vision Transformer for Video Classification
2023
会议录名称LECTURE NOTES IN COMPUTER SCIENCE (INCLUDING SUBSERIES LECTURE NOTES IN ARTIFICIAL INTELLIGENCE AND LECTURE NOTES IN BIOINFORMATICS)
ISSN0302-9743
卷号13844 LNCS
页码502-517
发表状态已发表
DOI10.1007/978-3-031-26316-3_30
摘要

Video transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

关键词Computer vision Classification tasks Long range interactions Longer-range interaction Performance Prior information State-of-the-art performance Transformer modeling Video classification
会议名称16th Asian Conference on Computer Vision, ACCV 2022
出版地GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND
会议地点Macao, China
会议日期December 4, 2022 - December 8, 2022
URL查看原文
收录类别EI ; CPCI-S
语种英语
资助项目NSFC[
WOS研究方向Computer Science
WOS类目Computer Science, Artificial Intelligence ; Computer Science, Theory & Methods
WOS记录号WOS:001000822000030
出版者Springer Science and Business Media Deutschland GmbH
EI入藏号20231313823284
EI主题词Classification (of information)
EISSN1611-3349
EI分类号716.1 Information Theory and Signal Processing ; 723.5 Computer Applications ; 741.2 Vision ; 903.1 Information Sources and Analysis
原始文献类型Conference article (CA)
文献类型会议论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/287907
专题信息科学与技术学院_PI研究组_高盛华组
通讯作者Zhuang, Liansheng
作者单位
1.Univ Sci & Technol China, Hefei 230026, Peoples R China
2.Peng Cheng Lab, Shenzhen 518000, Peoples R China
3.ShanghaiTech Univ, Shanghai 201210, Peoples R China
推荐引用方式
GB/T 7714
Li, Li,Zhuang, Liansheng,Gao, Shenghua,et al. HaViT: Hybrid-Attention Based Vision Transformer for Video Classification[C]. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND:Springer Science and Business Media Deutschland GmbH,2023:502-517.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Li, Li]的文章
[Zhuang, Liansheng]的文章
[Gao, Shenghua]的文章
百度学术
百度学术中相似的文章
[Li, Li]的文章
[Zhuang, Liansheng]的文章
[Gao, Shenghua]的文章
必应学术
必应学术中相似的文章
[Li, Li]的文章
[Zhuang, Liansheng]的文章
[Gao, Shenghua]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。