Joint Video and Text Parsing for Understanding Events and Answering Queries

doi:10.1109/MMUL.2014.29

	Joint Video and Text Parsing for Understanding Events and Answering Queries
	Kewei Tu1 ; Meng Meng 2; Mun Wai Lee 3; Tae Eun Choe 4; Song-Chun Zhu 2
	2014-04-01
发表期刊	IEEE MULTIMEDIA (IF:2.3[JCR-2023],3.0[5-Year])
ISSN	1070-986X
卷号	21 期号:2 页码:42-70
发表状态	已发表
DOI	10.1109/MMUL.2014.29
摘要	This article proposes a multimedia analysis framework to process video and text jointly for understanding events and answering user queries. The framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events), and causal information (causalities between events and fluents) in the video and text. The knowledge representation of the framework is based on a spatial-temporal-causal AND-OR graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes, and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. The authors present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs, and the joint parse graph. Based on the probabilistic model, the authors propose a joint parsing system consisting of three modules: video parsing, text parsing, and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text, respectively. The joint inference module produces a joint parse graph by performing matching, deduction, and revision on the video and text parse graphs. The proposed framework has the following objectives: to provide deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; to perform parsing and reasoning across the spatial, temporal, and causal dimensions based on the joint S/T/C-AOG representation; and to show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where, and why. The authors empirically evaluated the system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results.
关键词	Text recognition Semantics Computer vision Multimedia communication Streaming media Probabilistic logic Computational modeling
URL	查看原文
收录类别	SCI ; EI
语种	英语
资助项目	US National Science Foundation Cyber-Enabled Discovery and Innovation (CDI) grant Computer and Network Systems (CNS)[1028381]
WOS研究方向	Computer Science
WOS类目	Computer Science, Hardware & Architecture ; Computer Science, Information Systems ; Computer Science, Software Engineering ; Computer Science, Theory & Methods
WOS记录号	WOS:000337168900005
出版者	IEEE COMPUTER SOC
EI入藏号	20142317782075
EI主题词	Graphic methods ; Knowledge representation ; Probability distributions ; Semantics
EI分类号	Computer Software, Data Handling and Applications:723 ; Information Science:903 ; Probability Theory:922.1
WOS关键词	IMAGE
原始文献类型	Article
来源库	IEEE
引用统计	正在获取...
文献类型	期刊论文
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/2413
专题	信息科学与技术学院_PI研究组_屠可伟组
作者单位	1.ShanghaiTech University, China 2.University of California, Los Angeles 3.Intelligent Automation 4.ObjectVideo
第一作者单位	上海科技大学
第一作者的第一单位	上海科技大学
推荐引用方式 GB/T 7714	Kewei Tu,Meng Meng,Mun Wai Lee,et al. Joint Video and Text Parsing for Understanding Events and Answering Queries[J]. IEEE MULTIMEDIA,2014,21(2):42-70.
APA	Kewei Tu,Meng Meng,Mun Wai Lee,Tae Eun Choe,&Song-Chun Zhu.(2014).Joint Video and Text Parsing for Understanding Events and Answering Queries.IEEE MULTIMEDIA,21(2),42-70.
MLA	Kewei Tu,et al."Joint Video and Text Parsing for Understanding Events and Answering Queries".IEEE MULTIMEDIA 21.2(2014):42-70.