ShanghaiTech University Knowledge Management System
Evaluating Image Caption via Cycle-consistent Text-to-Image Generation | |
2025-01-07 | |
摘要 | Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free evaluation metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research has revealed that the modality gap generally exists in the representation of contrastive learning-based multi-modal systems, undermining the reliability of cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a cyclic reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Furthermore, to provide fine-grained information for a more comprehensive evaluation, we design a three-level evaluation framework for CAMScore that encompasses pixel-level, semantic-level, and objective-level perspectives. Extensive experiment results across multiple benchmark datasets show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics, demonstrating the effectiveness of the framework. |
DOI | arXiv:2501.03567 |
相关网址 | 查看原文 |
出处 | Arxiv |
WOS记录号 | PPRN:120340319 |
WOS类目 | Computer Science, Software Engineering |
文献类型 | 预印本 |
条目标识符 | https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/490278 |
专题 | 信息科学与技术学院_硕士生 信息科学与技术学院_PI研究组_石野组 |
通讯作者 | Shi, Ye |
作者单位 | 1.ShanghaiTech Univ, Shanghai, Peoples R China 2.Alibaba Grp, AI Business, Hangzhou, Peoples R China |
推荐引用方式 GB/T 7714 | Cui, Tianyu,Bai, Jinbin,Wang, Guohua,et al. Evaluating Image Caption via Cycle-consistent Text-to-Image Generation. 2025. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 |
修改评论
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。