An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training
2025-02-01
发表期刊INTERNATIONAL JOURNAL OF COMPUTER VISION (IF:11.6[JCR-2023],14.5[5-Year])
ISSN0920-5691
EISSN1573-1405
发表状态已发表
DOI10.1007/s11263-024-02327-w
摘要Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology. We use an observation-analysis-solution flow for our study. We first systematically observe different behaviors among the evaluated pre-training methods with respect to the downstream fine-tuning data scales. Furthermore, we analyze the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory transfer performance on data-insufficient downstream tasks. This finding is naturally a guide to designing our distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments have demonstrated the effectiveness of our approach. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$79.4\%$$\end{document}/78.9%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$78.9\%$$\end{document} top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K segmentation task (42.8%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$42.8\%$$\end{document} mIoU) and LaSOT tracking task (66.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$66.1\%$$\end{document} AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
关键词Masked image modeling Vision transformers Lightweight networks Knowledge distillation Visual tracking Image representation Machine vision Photointerpretation Vision Down-stream Fine tuning Image modeling Lightweight network Performance Pre-training Vision transformer Visual Tracking
URL查看原文
收录类别SCI ; EI
语种英语
资助项目Beijing Natural Science Foundation["JQ22014","L223003","4234087"] ; Natural Science Foundation of China["U22B2056","62422317","62192782","62036011","U2033210","62102417","62222206"] ; Project of Beijing Science and technology Committee[Z231100005923046]
WOS研究方向Computer Science
WOS类目Computer Science, Artificial Intelligence
WOS记录号WOS:001420026500001
出版者SPRINGER
EI入藏号20250717880186
EI主题词Image segmentation
EI分类号101.5 Ergonomics and Human Factors Engineering ; 1106.3.1 Image Processing ; 1106.8 Computer Vision ; 741.2 Vision ; 742.1 Photography
原始文献类型Article in Press
文献类型期刊论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/372903
专题信息科学与技术学院
通讯作者Wang, Shaoru
作者单位
1.Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China
2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 101408, Peoples R China
3.ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China
4.Megvii Res, Beijing 100089, Peoples R China
5.Beijing Inst Basic Med Sci, Beijing 100850, Peoples R China
6.Nanchang Hangkong Univ, Nanchang 330063, Peoples R China
7.Zhejiang Univ Technol, Hangzhou 310014, Peoples R China
推荐引用方式
GB/T 7714
Gao, Jin,Lin, Shubo,Wang, Shaoru,et al. An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION,2025.
APA Gao, Jin.,Lin, Shubo.,Wang, Shaoru.,Kou, Yutong.,Li, Zeming.,...&Hu, Weiming.(2025).An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training.INTERNATIONAL JOURNAL OF COMPUTER VISION.
MLA Gao, Jin,et al."An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training".INTERNATIONAL JOURNAL OF COMPUTER VISION (2025).
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Gao, Jin]的文章
[Lin, Shubo]的文章
[Wang, Shaoru]的文章
百度学术
百度学术中相似的文章
[Gao, Jin]的文章
[Lin, Shubo]的文章
[Wang, Shaoru]的文章
必应学术
必应学术中相似的文章
[Gao, Jin]的文章
[Lin, Shubo]的文章
[Wang, Shaoru]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。