Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
2024-01-04
会议录名称ARXIV
ISSN2159-5399
发表状态已发表
DOIarXiv:2401.02347
摘要

Image captioning aims at generating descriptive and meaning-ful textual descriptions of images, enabling a broad range of vision-language applications. Prior works have demonstrated that harnessing the power of Contrastive Image Language Pre-training (CLIP) offers a promising approach to achieving zero-shot captioning, eliminating the need for expensive caption annotations. However, the widely observed modality gap in the latent space of CLIP harms the performance of zero-shot captioning by breaking the alignment between paired image-text features. To address this issue, we conduct an analysis on the CLIP latent space which leads to two findings. Firstly, we observe that the CLIP’s visual feature of image subregions can achieve closer proximity to the paired caption due to the inherent information loss in text descriptions. In addition, we show that the modality gap between a paired image-text can be empirically modeled as a zero-mean Gaussian distribution. Motivated by the findings, we propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information, which pro-duces a compact visual representation for matching text rep-resentation. Moreover, we incorporate a noise injection and CLIP reranking strategy to boost captioning performance. We also extend our framework to build a zero-shot VQA pipeline, demonstrating its generality. Through extensive experiments on common captioning and VQA datasets such as MSCOCO, Flickr30k and VQAV2, we show that our method achieves remarkable performance improvements. Code is available at https://github.com/Artanic30/MacCap.

会议名称38th AAAI Conference on Artificial Intelligence (AAAI) / 36th Conference on Innovative Applications of Artificial Intelligence / 14th Symposium on Educational Advances in Artificial Intelligence
出版地2275 E BAYSHORE RD, STE 160, PALO ALTO, CA 94303 USA
会议地点null,Vancouver,CANADA
会议日期FEB 20-27, 2024
URL查看原文
收录类别CPCI-S
语种英语
资助项目Shanghai Science and Technology Program[
WOS研究方向Computer Science
WOS类目Computer Science, Artificial Intelligence ; Computer Science, Software Engineering
WOS记录号PPRN:86970833
出版者ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE
EISSN2374-3468
文献类型会议论文
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/381276
专题信息科学与技术学院_硕士生
信息科学与技术学院_PI研究组_何旭明组
信息科学与技术学院_博士生
通讯作者Qiu, Longtian
作者单位
1.ShanghaiTech Univ, Shanghai, Peoples R China
2.Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
第一作者单位上海科技大学
通讯作者单位上海科技大学
第一作者的第一单位上海科技大学
推荐引用方式
GB/T 7714
Qiu, Longtian,Ning, Shan,He, Xuming. Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training[C]. 2275 E BAYSHORE RD, STE 160, PALO ALTO, CA 94303 USA:ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE,2024.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Qiu, Longtian]的文章
[Ning, Shan]的文章
[He, Xuming]的文章
百度学术
百度学术中相似的文章
[Qiu, Longtian]的文章
[Ning, Shan]的文章
[He, Xuming]的文章
必应学术
必应学术中相似的文章
[Qiu, Longtian]的文章
[Ning, Shan]的文章
[He, Xuming]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。