Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

doi:arXiv:2401.02347

	Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
	Qiu, Longtian1 ; Ning, Shan1 ; He, Xuming1,2
	2024-01-04
会议录名称	ARXIV
ISSN	2159-5399
发表状态	已发表
DOI	arXiv:2401.02347
摘要	Image captioning aims at generating descriptive and meaning-ful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP’s visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which pro-duces a compact visual representation for matching text rep-resentation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.
会议名称	38th AAAI Conference on Artificial Intelligence (AAAI) / 36th Conference on Innovative Applications of Artificial Intelligence / 14th Symposium on Educational Advances in Artificial Intelligence
出版地	2275 E BAYSHORE RD, STE 160, PALO ALTO, CA 94303 USA
会议地点	null,Vancouver,CANADA
会议日期	FEB 20-27, 2024
URL	查看原文
收录类别	CPCI-S
语种	英语
资助项目	Shanghai Science and Technology Program[
WOS研究方向	Computer Science
WOS类目	Computer Science, Artificial Intelligence ; Computer Science, Software Engineering
WOS记录号	PPRN:86970833
出版者	ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE
EISSN	2374-3468
文献类型	会议论文
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/381276
专题	信息科学与技术学院_硕士生信息科学与技术学院_PI研究组_何旭明组信息科学与技术学院_博士生
通讯作者	Qiu, Longtian
作者单位	1.ShanghaiTech Univ, Shanghai, Peoples R China 2.Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
第一作者单位	上海科技大学
通讯作者单位	上海科技大学
第一作者的第一单位	上海科技大学
推荐引用方式 GB/T 7714	Qiu, Longtian,Ning, Shan,He, Xuming. Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training[C]. 2275 E BAYSHORE RD, STE 160, PALO ALTO, CA 94303 USA:ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE,2024.