DropMAE

Masked Autoencoders with Saptial-Attention Dropout for Tracking Tasks[1]

作者是来自CityU、IDEA、Tecent AI Lab、CUHK(SZ)的 Qiangqiang Wu、Tianyu Yang、Ziquan Liu、Baoyuan Wu、Ying Shan、Antoni B.Chan. 论文引用[1]:Wu, Qiangqiang et al. “DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 14561-14571.

Time

  • 2023.Apr

Key Words

  • masked autoencoder
  • temporal matching-based
  • spatial-attention dropout

动机

  1. 将MAE应用到下游任务如: visual object tracking(VOT) and video object segmentation(VOS). 简单的扩展MAE是mask out frame patches in videos and reconstruct the frame pixels.然而作者发现这个会严重依赖于spatial cues, 当进行frame reconstruction的时候忽略temporal relations, 这个导致sub-optimal temporal matching representations for VOT and VOS.

总结

  1. 采用adaptively spatial-attention dropout in the frame reconstruction,来促进temporal correspondence learning in videos. 作者展示了dropMAE是一个很强的、efficient temporal matching learner. 实现了better fine-tuning results on matching-based tasks. 作者还发现motion diversity in pre-training videos比scene diversity更重要。 预训练的dropMAE能够直接用现有的ViT-based trackers for fine-tuning without further modifications.

  2. 在VOT中最近的两个工作,SimTrack和OSTrack,都是用MAE pre-trained ViT model作为trackign backbone, 不需要复杂的tracking pipelines,实现了很好的效果。它们的MAE是在ImageNet上的。没有先验的时序信息能在静态图像上学习到。之前的追踪方法表明:temporal correspondence learning 对于develop一个robust tracker是关键的。因此develop MAE framework specifically for matching-based videos tasks 是一个机会。

  3. Extend MAE to videos 是随机mask out frame patches in a video clip, then reconstruct video clip. 像baseline TwinMAE,在重建patches的时候,会严重依赖 spatially neighbouring patches within the same frame, which implies a heavy co-adaptation of spatial cues(within-frame tokens) for reconstruction, 可能对于matching-based 下游任务,造成learning of sub-optimal temporal representation.

  4. 为了解决这个问题,dropMAE能够adaptively performs spatial-attention dropout to break up co-adaptation between spatial cues(within-frame tokens) during the frame reconstruction, 因此encouraging more temporal interaction and facilitating temporal correspondence learning.

  5. VOT:标首帧,然后预测之后的目标的boundingbox; VOS: 标首帧的binary mask,预测之后的target masks。之前VOT的方法,correlation filter-based approaches 占主导,因为它们能够很好地对target appearance variation进行建模。深度学习的发展,Siamese网络引入到了VOT, SiamFC用template和search images 作为target localization的输入。

  6. 自监督的方法中,有很多人工设计的代理任务(pretext tasks) for pre-training.例如:image colorization、jigsaw puzzle solving、future frame prediction、rotation prediction、Contrastive learning approaches. 然而,这些方法都对type 和strength of applied data augmentation敏感,使得它们很难训练。

  7. 不同于已有的VideoMAE,这个是用来做video action recognition的,在预训练的时候用一个long video clip(16 frames)。为了和VOT/VOS保持一致,在通常的object tracking中,从一个视频中采样2个frames作为TwinMAE的输入,来做预训练。

  8. Given a query token, 基本想法是adaptively drop a portion of its with-frame cues in order to facilitate the model to learn more reliable temporal correspondence. 就是限制同一个frame里的 query token和tokens之间的interaction, encourage more interactions with tokens in the other frame,这样模型会去学到更好的temporal matching ability.

  9. 用在VOT上的流程:

    • cropped template 和search images are firstly serialized into sequences and concatenated together,然后,总的sequence is added with the positional embeddings and input to the ViT backbone for joint feature extraction and interaction. 最后,updated search features are input to a prediction head to predict the target bounding box.
    • 在fine-tuning的时候,用预训练的DropMAE来初始化OSTracker中的ViT,同时 two frame identity embeddings are respectively added to template and search embeddings, in order to keep consistency with the pre-training stage. DropMAE \(Fig.3^{[1]}\): DropMAE: The proposed adaptive spatial-attention droput(ASAD) facilitates temporal correspondence learning for temporal matching tasks. TwinMAE follows the same pipeline except that the ASAD module is not used.