MAE as Spatiotemporal Learner

Masked Autoencoders As Spatiotemporal Learners[1]

作者们是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He。论文引用[1]:Feichtenhofer, Christoph et al. “Masked Autoencoders As Spatiotemporal Learners.” ArXiv abs/2205.09113 (2022): n. pag.

Key Words

  • extension MAE for video

  • minial domain knowledge

总结

  1. 有minimal domain knowledge: only spacetime-specific inductive bias is on embedding patches and their positions

  2. 有假设认为:MAE中的掩码率是与问题的信息冗余度相关。Masking ratio in Videos: 90%; in Image: 75%, BERT: 15%

  3. dataloading是一个新的瓶颈

  4. 架构:vanilla ViT。 \(2\times 16 \times 16: T \times H \times W, 表示在在时间上采2帧\) ,对于\(16 \times 224 \times 224 的输入,产生 8 \times 14 \times 14 个 tokens\)

  5. Patch embeddign:将video clip在时空上划分为常规的没有重叠的pathces, 然后patches are flattened and embedded by linear project. Position embeddings are added to the embedded pathces.这个patch和位置嵌入过程是唯一的一个spacetime-aware的过程。

  6. 采用分离的位置嵌入 for encoder。two positional embeddings: space and time。spacetime positional embeddings 是它们的和。

Experiments

  1. do MAE self-supervised pre-training and then fine-tune the encoder with supervision for evaluation.

  2. using fewer inductive biases and learning more from data, which is a pursuit of self-supervised learning.

有句话写的真好啊: We hope this exploration will shed light on the future study

Structure

Figure 1[1]: Masked Autoencoders as spatiotemporal learners. We mask a large subset (e.g., 90%) of random patches in spacetime. An encoder operates on the set of visible patches. A small decoder then processes the full set of encoded patches and mask tokens to reconstruct the input. Except for patch and positional embeddings, neither the encoder, the decoder, nor the masking strategy, has any spatiotemporal inductive bias.