Swin Transformer

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[1]

作者是来自MSRA的Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.论文引用[1]:Liu, Ze et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 9992-10002.

Time

  • 2021.Mar

Key Words

  • Shifted windows
  • non-overlapping local windows
  • hierarchical feature maps
  • linear computational complexity to image size
  • much lower latency

动机

  1. 将Transformer从language变到vision,因为两个domain之间的不同产生了一些挑战,例如在scale of visual entities中的large variations,high resolution of pixels in images compared to words in text.为了解决这些differences。提出了hierarchical Transformer whose representation is computed with Shifted windows。

总结

  1. shifted windowing通过将self-attetion computation限制在non-overlapping local windows while also allowing for cross-window connection,提高了效率。这个hierarchical 架构很灵活,能够对various scales进行model,有linear computational complexity with repsect to image size。Swin Transformer的这些特点使其兼容很多视觉任务。在一些数据集上的性能表现很好。

  2. 不像word tokens that serve as the basic elements of processing in language Transformers,visual elements vary substantially in scale.在目前的Transformer-based models中,tokens是fixed scale,unsuitable for vision applications。

  3. Swin Transformer by starting from small-sized patches gradually merging neighboring patches in deeper Transformer layers, 实现hierarchical representation. shifted windows bridges the windows of the preceding layer, 在它们之间提供了连接。All query patches within a window share the same key set.

  4. 标准的ViT架构和它的用于图像分类的改进型,都是进行全局的自注意力,导致计算量是相对于token数量的二次复杂度,使其对于要求密集的tokens for dense prediction or to represent a high-resolution image的视觉问题是不适合的。

  5. shifted window partitioning的一个问题是:它会导致更多的windows。naive solution是对小的windows 进行pad to \(M \times M\), 当计算注意力的时候,mask out padded values。当windows的数量 in regular partitioning is small,增加的计算量是很大的。因此提出了more efficient batch computation approach: by cyclic-shifting toward the top-left direction.

Framework \(Fig. 1^{[1]}\). (a) The proposed Swin Transformer builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. (b) In contrast, previous vision Transformers [20] produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of selfattention globally.

shifted window \(Fig.2^{[1]}\). An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture. In layer l (left), a regular window partitioning scheme is adopted, and self-attention is computed within each window. In the next layer l + 1 (right), the window partitioning is shifted, resulting in new windows. The self-attention computation in the new windows crosses the boundaries of the previous windows in layer l, providing connections among them.

Architecture \(Fig.3^{[1]}\). (a) The architecture of a Swin Transformer (Swin-T); (b) two successive Swin Transformer Blocks (notation presented with Eq. (3)). W-MSA and SW-MSA are multi-head self attention modules with regular and shifted windowing configurations, respectively.