HaViT: Hybrid-Attention Based Vision Transformer for Video Classification

doi:10.1007/978-3-031-26316-3_30

	HaViT: Hybrid-Attention Based Vision Transformer for Video Classification
	Li, Li 1; Zhuang, Liansheng 1; Gao, Shenghua3 ; Wang, Shafei 2
	2023
会议录名称	LECTURE NOTES IN COMPUTER SCIENCE (INCLUDING SUBSERIES LECTURE NOTES IN ARTIFICIAL INTELLIGENCE AND LECTURE NOTES IN BIOINFORMATICS)
ISSN	0302-9743
卷号	13844 LNCS
页码	502-517
发表状态	已发表
DOI	10.1007/978-3-031-26316-3_30
摘要	Video transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across the different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across the different videos. Experiments on Kinetics-400, Kinetics-600, and Something-Something v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
关键词	Computer vision Classification tasks Long range interactions Longer-range interaction Performance Prior information State-of-the-art performance Transformer modeling Video classification
会议名称	16th Asian Conference on Computer Vision, ACCV 2022
出版地	GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND
会议地点	Macao, China
会议日期	December 4, 2022 - December 8, 2022
URL	查看原文
收录类别	EI ; CPCI-S
语种	英语
资助项目	NSFC[
WOS研究方向	Computer Science
WOS类目	Computer Science, Artificial Intelligence ; Computer Science, Theory & Methods
WOS记录号	WOS:001000822000030
出版者	Springer Science and Business Media Deutschland GmbH
EI入藏号	20231313823284
EI主题词	Classification (of information)
EISSN	1611-3349
EI分类号	716.1 Information Theory and Signal Processing ; 723.5 Computer Applications ; 741.2 Vision ; 903.1 Information Sources and Analysis
原始文献类型	Conference article (CA)
文献类型	会议论文
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/287907
专题	信息科学与技术学院_PI研究组_高盛华组
通讯作者	Zhuang, Liansheng
作者单位	1.Univ Sci & Technol China, Hefei 230026, Peoples R China 2.Peng Cheng Lab, Shenzhen 518000, Peoples R China 3.ShanghaiTech Univ, Shanghai 201210, Peoples R China
推荐引用方式 GB/T 7714	Li, Li,Zhuang, Liansheng,Gao, Shenghua,et al. HaViT: Hybrid-Attention Based Vision Transformer for Video Classification[C]. GEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND:Springer Science and Business Media Deutschland GmbH,2023:502-517.