Dynamic Grained Encoder for Vision Transformers
2021
会议录名称ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS
ISSN1049-5258
卷号7
页码5770-5783
发表状态已发表
DOI暂无
摘要Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack. © 2021 Neural information processing systems foundation. All rights reserved.
关键词Modeling languages Object detection Computational costs De facto standard Fine grained Higher efficiency Language model Natural images Performance Spatial redundancy Spatial regions State of the art
会议名称35th Conference on Neural Information Processing Systems, NeurIPS 2021
会议地点Virtual, Online
会议日期December 6, 2021 - December 14, 2021
收录类别EI
语种英语
出版者Neural information processing systems foundation
EI入藏号20222412223132
EI主题词Signal encoding
EI分类号716.1 Information Theory and Signal Processing ; 723.2 Data Processing and Image Processing
原始文献类型Conference article (CA)
文献类型会议论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/251938
专题信息科学与技术学院_PI研究组_何旭明组
信息科学与技术学院_博士生
作者单位
1.College of Artificial Intelligence, Xi'an Jiaotong University, China;
2.ShanghaiTech University, China;
3.Megvii Inc. (Face++);
4.University of Chinese Academy of Sciences, China;
5.Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, China
推荐引用方式
GB/T 7714
Song, Lin,Zhang, Songyang,Liu, Songtao,et al. Dynamic Grained Encoder for Vision Transformers[C]:Neural information processing systems foundation,2021:5770-5783.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Song, Lin]的文章
[Zhang, Songyang]的文章
[Liu, Songtao]的文章
百度学术
百度学术中相似的文章
[Song, Lin]的文章
[Zhang, Songyang]的文章
[Liu, Songtao]的文章
必应学术
必应学术中相似的文章
[Song, Lin]的文章
[Zhang, Songyang]的文章
[Liu, Songtao]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。