ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition
2022-06-27
状态已发表
摘要

Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose {\bf I}mage-{\bf t}ext {\bf A}lignments (ITA) to align image features into the textual space, so that the attention mechanism in transformer-based pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts as a new cross-modal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical in dealing with text-only inputs and robust to noises from images. In our experiments, we show that ITA models can achieve state-of-the-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.

DOIarXiv:2112.06482
相关网址查看原文
出处Arxiv
WOS记录号PPRN:12230899
WOS类目Computer Science, Interdisciplinary Applications
资助项目National Natural Science Foundation of China[61976139]
文献类型预印本
条目标识符https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/348203
专题信息科学与技术学院
信息科学与技术学院_PI研究组_屠可伟组
信息科学与技术学院_博士生
作者单位
1.ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai, Peoples R China
2.Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
3.Alibaba Grp, DAMO Acad, Hangzhou, Peoples R China
4.Shopee, Singapore, Singapore
5.Microsoft, Redmond, WA 98052, USA
推荐引用方式
GB/T 7714
Wang, Xinyu,Gui, Min,Jiang, Yong,et al. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. 2022.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
个性服务
查看访问统计
谷歌学术
谷歌学术中相似的文章
[Wang, Xinyu]的文章
[Gui, Min]的文章
[Jiang, Yong]的文章
百度学术
百度学术中相似的文章
[Wang, Xinyu]的文章
[Gui, Min]的文章
[Jiang, Yong]的文章
必应学术
必应学术中相似的文章
[Wang, Xinyu]的文章
[Gui, Min]的文章
[Jiang, Yong]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。