ShanghaiTech University Knowledge Management System
ArCo: Attention-reinforced transformer with contrastive learning for image captioning | |
2022-12 | |
发表期刊 | IMAGE AND VISION COMPUTING (IF:4.2[JCR-2023],4.3[5-Year]) |
ISSN | 0262-8856 |
EISSN | 1872-8138 |
卷号 | 128 |
发表状态 | 已发表 |
DOI | 10.1016/j.imavis.2022.104570 |
摘要 | Image captioning is a significant step toward achieving automatic interactions between humans and computers, in which a textual sequence of the content of an image is generated. Recently, the transformer-based encoder–decoder paradigm has made great achievements in image captioning. This method is usually trained with a cross-entropy loss function. However, for various captions of images with the same meaning, the computed losses may be different. The result is that the descriptions of images tend to be consistent, which limits the diversity of image captioning. In this paper, we present an attention-reinforced transformer, a transformer-based architecture for image captioning. The architecture improves the image encoding stage, which exploits the relationships between image regions by integrating a feature attention block (FAB). During the training phase, we trained the model with a combination of cross-entropy loss and contrastive loss. We experimentally explored the performance of ArCo and other fully attentive models. We also validated the baseline of the transformer for image captioning with different pre-trained models. Our proposed approach was demonstrated to achieve a new state-of-the-art performance on the offline ‘Karpathy’ test split and online test server. |
关键词 | Contrastive learning Image captioning Transformer Visual attention |
URL | 查看原文 |
收录类别 | EI ; SCI |
语种 | 英语 |
资助项目 | CAS Interdisciplinary Innovation Team Project[JCTD-2020-10] |
WOS研究方向 | Computer Science ; Engineering ; Optics |
WOS类目 | Computer Science, Artificial Intelligence ; Computer Science, Software Engineering ; Computer Science, Theory & Methods ; Engineering, Electrical & Electronic ; Optics |
WOS记录号 | WOS:000891904500002 |
出版者 | ELSEVIER |
Scopus 记录号 | 2-s2.0-85141928664 |
来源库 | Scopus |
引用统计 | 正在获取...
|
文献类型 | 期刊论文 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/251413 |
专题 | 创意与艺术学院_PI研究组(P)_武颖娜组 物质科学与技术学院_硕士生 信息科学与技术学院_硕士生 创意与艺术学院_PI研究组(P)_翟梓融组 创意与艺术学院_PI研究组(P)_杨锐组 |
通讯作者 | Zhai, Zirong |
作者单位 | ShanghaiTech Univ, Shanghai 201210, Peoples R China |
第一作者单位 | 上海科技大学 |
通讯作者单位 | 上海科技大学 |
第一作者的第一单位 | 上海科技大学 |
推荐引用方式 GB/T 7714 | Wang, Zhongan,Shi, Shuai,Zhai, Zirong,et al. ArCo: Attention-reinforced transformer with contrastive learning for image captioning[J]. IMAGE AND VISION COMPUTING,2022,128. |
APA | Wang, Zhongan,Shi, Shuai,Zhai, Zirong,Wu, Yingna,&Yang, Rui.(2022).ArCo: Attention-reinforced transformer with contrastive learning for image captioning.IMAGE AND VISION COMPUTING,128. |
MLA | Wang, Zhongan,et al."ArCo: Attention-reinforced transformer with contrastive learning for image captioning".IMAGE AND VISION COMPUTING 128(2022). |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。