ShanghaiTech University Knowledge Management System
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training | |
2025-02-01 | |
发表期刊 | INTERNATIONAL JOURNAL OF COMPUTER VISION (IF:11.6[JCR-2023],14.5[5-Year]) |
ISSN | 0920-5691 |
EISSN | 1573-1405 |
发表状态 | 已发表 |
DOI | 10.1007/s11263-024-02327-w |
摘要 | Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology. We use an observation-analysis-solution flow for our study. We first systematically observe different behaviors among the evaluated pre-training methods with respect to the downstream fine-tuning data scales. Furthermore, we analyze the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory transfer performance on data-insufficient downstream tasks. This finding is naturally a guide to designing our distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments have demonstrated the effectiveness of our approach. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$79.4\%$$\end{document}/78.9%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$78.9\%$$\end{document} top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K segmentation task (42.8%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$42.8\%$$\end{document} mIoU) and LaSOT tracking task (66.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$66.1\%$$\end{document} AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers. |
关键词 | Masked image modeling Vision transformers Lightweight networks Knowledge distillation Visual tracking Image representation Machine vision Photointerpretation Vision Down-stream Fine tuning Image modeling Lightweight network Performance Pre-training Vision transformer Visual Tracking |
URL | 查看原文 |
收录类别 | SCI ; EI |
语种 | 英语 |
资助项目 | Beijing Natural Science Foundation["JQ22014","L223003","4234087"] ; Natural Science Foundation of China["U22B2056","62422317","62192782","62036011","U2033210","62102417","62222206"] ; Project of Beijing Science and technology Committee[Z231100005923046] |
WOS研究方向 | Computer Science |
WOS类目 | Computer Science, Artificial Intelligence |
WOS记录号 | WOS:001420026500001 |
出版者 | SPRINGER |
EI入藏号 | 20250717880186 |
EI主题词 | Image segmentation |
EI分类号 | 101.5 Ergonomics and Human Factors Engineering ; 1106.3.1 Image Processing ; 1106.8 Computer Vision ; 741.2 Vision ; 742.1 Photography |
原始文献类型 | Article in Press |
文献类型 | 期刊论文 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/372903 |
专题 | 信息科学与技术学院 |
通讯作者 | Wang, Shaoru |
作者单位 | 1.Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China 2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 101408, Peoples R China 3.ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai 201210, Peoples R China 4.Megvii Res, Beijing 100089, Peoples R China 5.Beijing Inst Basic Med Sci, Beijing 100850, Peoples R China 6.Nanchang Hangkong Univ, Nanchang 330063, Peoples R China 7.Zhejiang Univ Technol, Hangzhou 310014, Peoples R China |
推荐引用方式 GB/T 7714 | Gao, Jin,Lin, Shubo,Wang, Shaoru,et al. An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION,2025. |
APA | Gao, Jin.,Lin, Shubo.,Wang, Shaoru.,Kou, Yutong.,Li, Zeming.,...&Hu, Weiming.(2025).An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training.INTERNATIONAL JOURNAL OF COMPUTER VISION. |
MLA | Gao, Jin,et al."An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training".INTERNATIONAL JOURNAL OF COMPUTER VISION (2025). |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。