ArCo: Attention-reinforced transformer with contrastive learning for image captioning

doi:10.1016/j.imavis.2022.104570

	ArCo: Attention-reinforced transformer with contrastive learning for image captioning
	Wang, Zhongan; Shi, Shuai; Zhai, Zirong; Wu, Yingna; Yang, Rui
	2022-12
发表期刊	IMAGE AND VISION COMPUTING (IF:4.2[JCR-2023],4.3[5-Year])
ISSN	0262-8856
EISSN	1872-8138
卷号	128
发表状态	已发表
DOI	10.1016/j.imavis.2022.104570
摘要	Image captioning is a significant step toward achieving automatic interactions between humans and computers, in which a textual sequence of the content of an image is generated. Recently, the transformer-based encoder–decoder paradigm has made great achievements in image captioning. This method is usually trained with a cross-entropy loss function. However, for various captions of images with the same meaning, the computed losses may be different. The result is that the descriptions of images tend to be consistent, which limits the diversity of image captioning. In this paper, we present an attention-reinforced transformer, a transformer-based architecture for image captioning. The architecture improves the image encoding stage, which exploits the relationships between image regions by integrating a feature attention block (FAB). During the training phase, we trained the model with a combination of cross-entropy loss and contrastive loss. We experimentally explored the performance of ArCo and other fully attentive models. We also validated the baseline of the transformer for image captioning with different pre-trained models. Our proposed approach was demonstrated to achieve a new state-of-the-art performance on the offline ‘Karpathy’ test split and online test server.
关键词	Contrastive learning Image captioning Transformer Visual attention
URL	查看原文
收录类别	EI ; SCI
语种	英语
资助项目	CAS Interdisciplinary Innovation Team Project[JCTD-2020-10]
WOS研究方向	Computer Science ; Engineering ; Optics
WOS类目	Computer Science, Artificial Intelligence ; Computer Science, Software Engineering ; Computer Science, Theory & Methods ; Engineering, Electrical & Electronic ; Optics
WOS记录号	WOS:000891904500002
出版者	ELSEVIER
Scopus 记录号	2-s2.0-85141928664
来源库	Scopus
引用统计	正在获取...
文献类型	期刊论文
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/251413
专题	创意与艺术学院_PI研究组(P)_武颖娜组物质科学与技术学院_硕士生信息科学与技术学院_硕士生创意与艺术学院_PI研究组(P)_翟梓融组创意与艺术学院_PI研究组(P)_杨锐组
通讯作者	Zhai, Zirong
作者单位	ShanghaiTech Univ, Shanghai 201210, Peoples R China
第一作者单位	上海科技大学
通讯作者单位	上海科技大学
第一作者的第一单位	上海科技大学
推荐引用方式 GB/T 7714	Wang, Zhongan,Shi, Shuai,Zhai, Zirong,et al. ArCo: Attention-reinforced transformer with contrastive learning for image captioning[J]. IMAGE AND VISION COMPUTING,2022,128.
APA	Wang, Zhongan,Shi, Shuai,Zhai, Zirong,Wu, Yingna,&Yang, Rui.(2022).ArCo: Attention-reinforced transformer with contrastive learning for image captioning.IMAGE AND VISION COMPUTING,128.
MLA	Wang, Zhongan,et al."ArCo: Attention-reinforced transformer with contrastive learning for image captioning".IMAGE AND VISION COMPUTING 128(2022).