MViTv2

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection[1]

作者和MViT一样,是来自FAIR和UC Berkeley的Yang hao Li, Chao-Yuan Wu等人。论文引用[1]:Li, Yanghao et al. “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 4794-4804.

Time

  • 2021.Dec

Key Words

  • MViT that incorporates decomposed relative positional embeddings and residual pooling connections.

总结

  1. ViT用于高分辨率的object detection和space-time video understanding tasks仍然具有挑战。密集的visual signals 在计算和memory requirements上提出了挑战,因为在Transformer-based model中,scale quadratically in complexity within the self-attention blocks。大家有不同的策略来解决这个问题:
    • local attention computation within a window for object detection
    • pooling attention that locally aggregates features before computing self-attention in video tasks. MViT就是采用的后者,不同于ViT:ViT的整个网络是一个固定的resolution, MViT是一个feature hierarchy with multiple stages starting from high-resolution to low-resolution.
  2. MViTv2在MViT的基础上,做了几点改进:
    • 通过提高pooling attention along two axes:
      • shift-invariant positional embeddings using decomposed location distances to inject position information in Transformer blocks
      • a residual pooling connection to compensate the effect of pooling strides in attention computation.
    • 采用standard dense prediction framework: Mask RCNN with FPN, using improved structure of MViT, apply it to object detection and instance segmentation. 作者研究MViT是否能通过 pooling attention 来克服computation和memory cost, 来处理high-resolution visual input. 实验表明:pooling attention is more effective than local window attention mechanisms, 进一步develop了一个 simple-yet-effective Hybrid window attention, 能够complement pooling attention for better accuracy/compute tradeoff.
  3. Improved Multiscale Vision Transformers.
    • Improved Pooling Attention

      • Decomposed Relative Positional Embeddings: MViT 展示了model tokens之间的interaction的能力,它们关注content,而不是structure,space-time structure建模仅依赖于absolute postional embedding,来提供location information。这个忽略了shift-invariance in vision的基本原则。也就是说:MViT对两个patches之间的interaction的建模会根据它们的绝对位置而变化,即使它们的relative positions保持不变。为了解决这个问题,把仅依赖于tokens之间的相对location distance的相对位置嵌入,与pooled self-attention computation相结合。将relative position between two input elements, i and j 编码到 positional embedding \(R_{p(i),p(j)}\in\mathbb{R}^d,\) \(p(i),p(j)\) 表示 element ij的 spatial position,这个成对的 encoding representation 然后被嵌入到 self-attention module之中。 \[\mathrm{Attn}(Q,K,V)=\mathrm{Softmax}\left((QK^{\top}+E^{(\mathrm{rel})})/\sqrt{d}\right)V,\\\mathrm{where}\quad E_{ij}^{(\mathrm{rel})}=Q_{i}\cdot R_{p(i),p(j)}.\text{(3)}\]

      然而,可能的embeddings的数量 \(R_{p(i),p(j)}\) scale in \(O(TWH)\),计算比较expensive,为了减小complexity,将element ij之间的 distance computation 在spatiotemporal axes方向上进行 decompose, \[R_{p(i),p(j)}=R_{h(i),h(j)}^\mathrm{h}+R_{w(i),w(j)}^\mathrm{w}+R_{t(i),t(j)}^\mathrm{t},\]

      \(R^h,R^w, R^t\) 分别是height, width和temporal axes方向上的 positional embeddings,\(h(i), w(i), t(i)\) 分别表示 vertical, horizontal 和temporal position of token i,注意到 \(R^t\)是一个 optional,在video case中 required,来support temporal dimension。相比之下,decomposed embeddings将 learned embeddings的数量减小到 \(O(T+W+H)\),对于early-stage, high-resolution的feature maps有很大的影响。

      • Residual pooling connection:pooling attention 对于减小计算复杂度和memory requirements in attention blocks是非常有效的,MViTv1有很大的strides on \(K\)\(V\) tensors than the stride of the \(Q\) tensor,which is only downsampled if the resolution of the output sequence changes across stages。这个motivates us 来增加 residual pooling connection with the (pooled)Q tensor,来增加information flow,促进pooling attention blocks的训练。引入了一个新的residual pooling connection inside the attention blocks,增加一个pooled query tensor to output sequence \(Z\)
    • MViT for object detection:MViT的层次化结构,产生了多尺度的feature maps in four stages,因此可以很自然地集成到FPN网络里。

      • Hybrid window attentionTransformer中的自注意力有quadratic complexity w.r.t the number of tokens。对于目标检测来说,它有更高分辨率的输入和feature maps,文章中,研究了两种方式,减小compute 和memory complexity:首先,pooling attention;第二个是 window attention。 Pooling attention和window attention都是用来控制自注意力的复杂度的,通过减小query,key, value的size,它们的内在本质是不一样的:Pooling attention pools features by downsampling them via local aggregation,但是keeps a global self-attention computation,然而,window attention keeps the resolution of tensors,但是performs self-attention locally by dividing the input(patchified tokens)into non-overlapping windows,only compute local self-attention within each window。这个内在本质的不同,motivates us to study 它们是否能够在目标检测任务上互补。

      尽管window attention仅在windows内进行local self-attention,缺乏跨windows的connections。不同于Swin(用了shifted windows来减弱这个问题),提出了简单的hybrid window attention design to add cross-window connections。

      • Positional embeddings in detecition:不同于ImageNet的图像分类,其输入是fixed resolution,目标检测包括了varying sizes inputs。对于positional embeddings in MViT(either absolute or relative), 首先从ImageNet预训练的、对应于 \(224 \times 224\) input size的positinoal embeddings的weights进行初始化。然后将它们interpolate到各自的sizes,用于目标检测的训练。

MViTv2 \(Fig.1^{[1]}\) MViTv2 is a multiscale transformer with state-of-the-art performance across thres visual recognition tasks.

Improved Pooling Attention \(Fig.2^{[1]}\) The improved Pooling Attention mechanism that incorporating decomposed relative position embedding, \(R_{p(i),p(j)}\), and residual pooling connection modules in the attention block.

Multiscale MViTv2 \(Fig.3^{[1]}\) MViT backbone used with FPN for object detection. The multiscale transformer features naturally integrate with standard FPN