Video Swin Transformer

发表于 2024-04-19 更新于 2024-10-18 分类于 Papers 阅读次数：本文字数： 1.4k 阅读时长 ≈ 5 分钟

Video Twin Transformer^[1]

作者是来自MSRA、USTC、HUST、THU的Ze Liu, Jia Ning, Yue Cao, Yixuan Wei,Zheng Zhang, Stephen Lin, Han Hu。论文引用[1]:Liu, Ze et al. “Video Swin Transformer.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 3192-3201.

Time

2021.Jun

总结：

建立在Transformer上的视频模型能偶全局链接patches across the spatial and temporal dimensions。在本文中，**支持video Transformers的局部的归纳偏置(inductive bias of locality)。能够实现很好的速度-精度的平衡。这个locality是通过采用Swin Transformer来实现的。之前的卷积模型的backbone for video都是采用来自iamge的，通过简单地在时间轴上进行扩展。例如：3D 卷积是2D 卷积的直接扩展，用于spatial 和 temporal modeling at the operator level。

本文提出了一个纯transformer的backbone，用于video recognition，发现超过了那些factorized model，通过利用视频内在的spatiotemporal locality来实现，pixels在时空距离上更相近的更可能相关；由于这个特性，full spatiotemporal self-attention能够通过自注意力很好地近似，同时节省了计算量和模型的大小。通过对Swin Transformer进行时空上的改变。SwinTransformer引入了归纳偏置 for spatial locality, as well as for hierarchy and translation invariance。Video Swin Transformer严格地follow原始的Swin Transformer的hierarchical的结构，但是局部的注意力的计算从spatial扩展到spatial-temporal。由于local attention是在non-overlapping windows上进行计算的，原始的Swin Transformer的shifted window mechanism也重新转换来处理spatio-temporal input。
输入的视频的大小为 $T \times H \times W \times 3$，包含了 $T$ frames，每个是 $H \times W \times 3$ 个像素。将每个3D patch $2 \times 4 times 4 \times 3$ 视为一个token。因此，输入token的数量为 $\frac{T}{2} \times \frac{H}{4} \times \frac{W}{4}$ 个 3D token。每个token包含 96维的feature。然后用一个linear embedding layer，来project 每个token的feature 到 $C$ 维。 Video Swin Transformer block主要是用了3D shifted window。相比于图像，视频会输入更多的tokens，因为有了额外的temporal dimension。一个全局的self-attention module不太适合了，因为会导致大量的计算和存储成本。这里，跟着Swin Transformer，引入locality inductive bias to the self-attention module，这个表明比video recognition更有效。
非重叠3D窗口上的多头自注意力：这里，直接将2D 种的MSA用来处理视频输入，给定一个视频，包含 $T' \times H' \times W'$ 个3D tokens的视频，和一个 $P \times M \times M$ 的 3D window size。windows是用来对video input进行不重叠的划分。因此，输入的tokens被划分成 $ $ 个不重叠的 3D 窗口。如图所示，对于一个 $8 \times 8 \times 8$ 的输入tokens和 $4 $ 的window size。windows的数量就是 $2 \times 2 \times 2$。然后在每个3D 窗口内进行多头自注意力。
当每个非重叠的3D window内进行多头自注意力操作时，缺乏不同windows之间的连接。可能限制了架构的represetation的能力。因此，将shifted 2D window mechanism of Swin Transformer扩展到 3D window，用于引入cross-windows的连接，同时维持非重叠window的高效的计算。给定输入 $T' \times H' \times W'$ 个3D 输入tokens，每个 3D window的size是 $P \times M \times M$，对于连续的两个layer，第一个layer中的自注意力模块用常规的window partition 策略，得到 $ $ 个非重叠的窗口。对于第二个窗口中的自注意力模块，window partition在temporal、height、width上shift了 $(\frac{P}{2}, \frac{M}{2}，\frac{M}{2})$ 个tokens。
3D Relative Position Bias：之前的很多工作展示了引入relative position bias to each head i self-attention computation是有利的。因此，follow之前的工作，引入3D relative position bias $B\in\mathbb{R}^{P^{2}\times M^{2}\times M^{2}}$ for each head as \[\mathrm{Attention}(Q,K,V)=\mathrm{SoftMax}(QK^T/\sqrt{d}+B)V,\]

$Q,K,V\in\mathbb{R}^{PM^{2}\times d}$ 是 Query, Key, value。d 是query和key的dimension。$PM^2$是 3D window的tokens的数量。因为每个 axis上的relative position是在 $[-P+1,+-1](temporal) or [-M+1, M-1](height or width)，对 bias 矩阵B进行参数化，$^{(2P-1)(2M-1)(2M-1)}$，values in $B$ 都是来自 $\hat{B}$。

在这个模型中，输入的token膨胀到temporal dimension of 2, 因此 linear embedding layer的shape从原始Swin的 $48 \times C$ 变成了 $96 \times C$。这里，直接复制pre-trained model里的weighs twice，然后乘以整个矩阵 by 0.5 来保证 mean 和variance of the output 不变。 relative position bias matrix的shape是 $(2P-1, 2M-1, 2M-1)$，相比于原始Swin中的 $(2M-1, 2M-1)$，为了使relative position bias在每个frame中相同，在pre-trained model中复制 $2P-1$ 次，来得到 $(2P-1, 2M-1, 2M-1)$ for initialization。

Model Architecture $Fig.1^{[1]}$ Overall architecture of Video Swin Transformer

3D shifted window $Fig.2^{[1]}$ An illustrated example of 3D shifted windows. The input size $T' \times H' \times W'$ is $8 \times 8 \times 8$, and the 3D window size $P \times M \times M$ is $4 \times 4 \times 4$. As layer $l$ adopts regular window partitioning, the number of windows in layer $l$ is $2 \times 2 \times 2=8$. For layer $l+1$, as the windows are shifted by $(\frac{P}{2},\frac{M}{2}, \frac{M}{2}) = (2,2,2)$ tokens, the number of windows becomes $3 \times 3 \times 3=27$. Though the number of windows is increased, the efficient batch computation in for the shifted configuration can be followed, such that the final number of windows for computation is still 8.

Video Twin Transformer[1]

Time

总结：

Video Twin Transformer^[1]