ViC-MAE

发表于 2025-06-22 更新于 2025-06-24 分类于 Papers 阅读次数：本文字数： 1.6k 阅读时长 ≈ 6 分钟

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders^[1]

作者是来自Rice University和Google DeepMind的Jefferson Hernandez等人，论文引用[1]:Hernandez, Jefferson et al. “ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders.” (2023).

Time

2024.Oct

Key Words

MAE
contrastive learning
treat short videos as temporal augmentations

总结

作者提出了VIC-MAE，是一个结合了MAE和contrastive learning的model，VIC-MAE通过使用一种global representation进行训练的，该global representation是通过对在 MAE 重建损失下学习到的局部特征进行池化（pooling）得到的，并在图像与视频帧之间基于这一表示进行对比学习目标（contrastive objective）的训练。作者展示了在VIC-MAE下学习到的visual representation能够很好地泛化到video 和image 分类的任务，VIC-MAE相比于最近提出的OmniMAE，实现了SOTA的tranfer learning的性能。

最近SSL visual representation learning的进展提高了image和video benchmarks上的表现。这个成功主要是有两个路径驱动的：Joint-embedding methods, 旨在鼓励对特定变换的invariance，要么constrastive，要么negative-free，还有masked image modeling，通过随机mask out input parts，用reconstruction loss使得model来预测maske parts，这些ideas在image和video上都取得了成功。用于video representation的自监督的方法，实现了很多很大的成功，能够得到powerful features，在多个下游任务上表现很好，利用image-based models增强video faeture representations有很广泛的应用。相反：从video-to-image的transfer leanring，没有那么成功，这个不平衡凸显了多模态学习中细微而复杂的挑战，不清楚如何集成不同的模态，另外，尝试结合这些模态导致性能下降，使得需要对一些架构做必要的修改或者将一个模态转换成另一个模态，从视频中学习应该产生很好的image representations，因为video就是包含复杂的changes in pose, viewpoint等。这些variantions不能通过joint-embedding或者MIM方法中的标准的image augmentations来simulate，在这个工作中，作者提出了**Visual COntrastive Masked AutoEncoder，用SSL,从images和videos中学习，而不是将short videos视为相同representations的不同views。作者的实验中，这个model提高了video-to-image 的transfer performance。

之前的工作成功地利用了SSL for video 或者images，用contrastive learning或者masked image modeling，ViC-MAE尝试利用contrastive learning和MIM的优势，ViC-MAE将frames用short intervals进行sample，作为一种额外形式的temporal data augmentation，作者的方法用了contrastive learning来对齐time-shifted frames和augmented views之间的representations，MIM for single video frames or images来学习local features，不同于之前的方法，仅用一个[CLS] tokens作为一个global feature，作者的方法用一个global pooling layer，来汇聚local features，接着一个contrastive loss来进一步增强representation，这个结构建立在ViT架构上。

和作者的工作接近的是最近提出的OmniMAE，旨在作为一个SSL的foundation for image和video downstream tasks，OmniMAE依赖于MIM，将images视为videos，然而ViC-MAE对frames进行稀疏采样，将short time span中的videos视为相同的view，ViC0MAE降低了training times，需要更少的resources，更高效

作者的一些findings如下：
- **将short videos视为augmented views，在regular videos或者images上进行微调，得到比将images视为videos更强的性能，同时model保持了temporal representations
- 用large frame gaps between sampled frames增强了分类的性能
- 在训练中包含negative pairs，超过了negative-free sample training，和其它方法对齐实现了video-to-image的成功
- 用strong image transformations作为augmentations是在images上实现good performance的所必须的
Self-supervised Video Learning: SSL利用videos中的temporal information来学习representations，通过设计代理任务，哟个video 内在的properties例如frame continuity，超过了从static images学习到的representations。contrastive learning方法通过用video temporality，区分training instances来处理video learn。最近， MIM用video来pretrianing，作者的方法将contrastive learning和MIM集成到单个pre-training框架中，用于image和video 下游应用。
- Learning video-to-image representations: 一些之前的仅在images上训练的models，展示出来了很好的image-to-video的adaptation。然而，static images缺乏videos的dynamism，缺少motion cues和cameara view changes，这个削弱了image-based models用于video applications，最近的工作利用video data来学习robust image representations来缓解这个问题，例如：VINCE展示了，videos中的natural agumentations能够超过合成的augmentations。VFS用temporal relationships来提高在static image tasks上的performance。CRW利用循环一致性，实现视频间的图像映射，从而能够学习帧与帧之间的对应关系。ST-MAE展示了video-oriented MIM能够有利于image-centric tasks, VITO开发了一个方法，用于video dataset curation，来缩小video和images之间的domain gap。
- Learning general representations from video and iamges：最近的TubeViT用sparse video tubes，在image和video中创建visual token， OMNIVORE利用一个通用的encoder用于多个模态，用specific heads for each task，BEVT采用了一个BERT类型的方法 for video，OmniMAE提出了一个MAE，用于joint training with video and image.
- 将contrastive methods和MIM结合的方法：MSN结合了masking和augmentations 用于高效的对比学习，用了entropy maximization而不是pixel reconstruction，避免representational collapse。
ViC-MAE: 用contrastive learning at temporal level, masked image modeling at the image level。
- 给定要给T frames的video，作者采样两个frames \(I_i, I_j\) 作为positive pair input，当它们在一个batch中的时候，augment single images，注意到，作者的model所接收的批次中既包含视频帧，也包含静态图像。在经过input image tokenizer layer，作者得到了一组patch-level token representations of \(X_i, X_j\) for each frame，然后产生token masking，加到对应的input frames上，visbile token给到ViT encoder，分别计算它们的representations。为了利用对比预训练，作者用了一个单独的prediction branch，对来自main branch的output representations和siamese copy of the network的represetations进行global pooling，这个步骤简化了方法，避免了额外的损失。这些global representations然后给到一个predictor encoder 和一个target encoder来得到frame representations。

ViC-MAE \(Fig.1^{[1]}\) ViC-MAE输入一个video或者same batch中的一个image不同的views的两个frames，然后随机mask，通过对local features的global pooling，得到global representation，然后一个标准的predictor和一个target encoder，并结合contrastive loss进行训练。用了一个aggregation layer before predictor network，避免学习到的global representation的collapse。

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders[1]

Time

Key Words

总结

ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders^[1]