Learning Video Representations without Natural Videos

doi:arXiv:2410.24213

	Learning Video Representations without Natural Videos
	Yu, Xueyang1 ; Chen, Xinlei 2; Gandelsman, Yossi 3
	2024-10-31
状态	已发表
摘要	In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.
语种	英语
DOI	arXiv:2410.24213
相关网址	查看原文
出处	Arxiv
收录类别	PPRN.PPRN
WOS记录号	PPRN:118937512
WOS类目	Computer Science, Software Engineering
文献类型	预印本
条目标识符	https://kms.shanghaitech.edu.cn/handle/2MSLDSTB/458364
专题	信息科学与技术学院_本科生
通讯作者	Yu, Xueyang
作者单位	1.ShanghaiTech Univ, Shanghai, Peoples R China 2.Meta AI, San Francisco, CA, USA 3.Univ Calif Berkeley, Berkeley, CA, USA
推荐引用方式 GB/T 7714	Yu, Xueyang,Chen, Xinlei,Gandelsman, Yossi. Learning Video Representations without Natural Videos. 2024.