ShanghaiTech University Knowledge Management System
HaViT: Hybrid-Attention Based Vision Transformer for Video Classification | |
2023 | |
会议录名称 | LECTURE NOTES IN COMPUTER SCIENCE (INCLUDING SUBSERIES LECTURE NOTES IN ARTIFICIAL INTELLIGENCE AND LECTURE NOTES IN BIOINFORMATICS)
![]() |
ISSN | 0302-9743 |
卷号 | 13844 LNCS |
页码 | 502-517 |
发表状态 | 已发表 |
DOI | 10.1007/978-3-031-26316-3_30 |
摘要 | Video transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG. |
关键词 | Computer vision Classification tasks Long range interactions Longer-range interaction Performance Prior information State-of-the-art performance Transformer modeling Video classification |
会议名称 | 16th Asian Conference on Computer Vision, ACCV 2022 |
出版地 | GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND |
会议地点 | Macao, China |
会议日期 | December 4, 2022 - December 8, 2022 |
URL | 查看原文 |
收录类别 | EI ; CPCI-S |
语种 | 英语 |
资助项目 | NSFC[ |
WOS研究方向 | Computer Science |
WOS类目 | Computer Science, Artificial Intelligence ; Computer Science, Theory & Methods |
WOS记录号 | WOS:001000822000030 |
出版者 | Springer Science and Business Media Deutschland GmbH |
EI入藏号 | 20231313823284 |
EI主题词 | Classification (of information) |
EISSN | 1611-3349 |
EI分类号 | 716.1 Information Theory and Signal Processing ; 723.5 Computer Applications ; 741.2 Vision ; 903.1 Information Sources and Analysis |
原始文献类型 | Conference article (CA) |
文献类型 | 会议论文 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/287907 |
专题 | 信息科学与技术学院_PI研究组_高盛华组 |
通讯作者 | Zhuang, Liansheng |
作者单位 | 1.Univ Sci & Technol China, Hefei 230026, Peoples R China 2.Peng Cheng Lab, Shenzhen 518000, Peoples R China 3.ShanghaiTech Univ, Shanghai 201210, Peoples R China |
推荐引用方式 GB/T 7714 | Li, Li,Zhuang, Liansheng,Gao, Shenghua,et al. HaViT: Hybrid-Attention Based Vision Transformer for Video Classification[C]. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND:Springer Science and Business Media Deutschland GmbH,2023:502-517. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。