VideoMAE
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining[1]
作者是Zhan Tong, Yibing Song, Jue Wang 和王利民,分别来自南大,腾讯和上海AI Lab,论文引用[1]:Tong, Zhan et al. “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.” ArXiv abs/2203.12602 (2022): n. pag.
Time
Key Words
- video masked autoencoder using plain ViT backbones, tube masking with high ratio
- data-efficient learner that could be successfully trained with only 3.5k videos. Data quality more important than quantity for SSVP(self-supervised video pretraining) when a domain shift exists between source and target dataset.
动机
对于Video Transformers,通常是derived from 基于图像的transformer,严重依赖于从大规模图像数据的pre-trained models,高效地训练一个vanilla vision transformer on the video dataset without any pre-trianed model or extra image data是一个挑战。