ViViT

发表于 2024-04-19 分类于 Papers 阅读次数：本文字数： 1.8k 阅读时长 ≈ 6 分钟

ViViT: A Video Vision Transformer^[1]

作者是来自Google Research的Anurag Arnab, Mostafa Dehghani, Georg, Heigold, Chen Sun, Mario Lucic, Cordelia Schmid。论文引用[1]:Arnab, Anurag et al. “ViViT: A Video Vision Transformer.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 6816-6826.

Time

2021.Jun

Key Words

spatio-temporal tokens
transformer
regularize model, factorising model along spatial and temporal dimensions to increase efficiency and scalability

动机

当有大的训练集的时候，基于transformer的模型很有效。作者展示了：如何在训练期间regularise model并且leverage pretrained image model，使其能在相对小的数据集上进行训练。

总结

将video V mapping to a sequence of tokens Z的方法：
- Uniformly frame sampling. 从video clip中 uniformly sample $n_t$ frames，用和ViT中同样的方式embed each 2D frame independently.
$Fig.1^{[1]} $:Uniform frame sampling: We simply sample nt frames, and embed each 2D frame independently following ViT.
- Tubelet embedding. extract non-overlapping、spatio-temporal "tube" from input volume, and to linearly project this to $R^d$, 这个方法是ViT's embedding to 3D的扩展。Tubelet的dimension: $ t h w, n_t = ,n_h = ,n_w = $, tokens是 extracted从temporal、height、width dimensions respectively。smaller tubelet dimensions会导致更多的tokens，增加计算量。这个方法融合了spatio-temporal information during tokenization。与Uniform frame sampling 不同，后者是通过transformer来从不同的frames中融合temporal information。
$Fig.2^{[1]}$:Tubelet embedding. We extract and linearly embed non-overlapping tubelets that span the spatio-temporal input volume
Transformer Models for Video: 用大白话说就是：一个是直接的方式，得到所有token，然后送入ViT的encoder；另外就是对encoder和encoder内部的self-attention模块进行分解，分解为一个针对spatial、一个针对temporal的模块。
- Model 1: Spatio-temporal attention:就是把从视频中抽取出来的tokens全部送入transformer的encoder中，每个layer会对所有的tokens进行model pairwise interactions，复杂度和tokens的数量相关。
- Model 2: Factorized encoder: model包含两个分离的encoders，spatial encoder进队同一个temporal 中出来的tokens进行model interaction，frame-level represetations $h_i$ 会被concatenated into $H$，然后通过包含 $L_t layers的 temporal encoder，来model interactions between tokens from different temporal indices。这个encoder的输出的token最后被classified。这个model比Model 1有更多的layers(more parameters)，它要求个更少的fewer floating point operations(FLOPS)。 $Fig.3^{[1]}$:Factorised encoder (Model 2). This model consists of two transformer encoders in series: the first models interactions between tokens extracted from the same temporal index to produce a latent representation per time-index. The second transformer models interactions between time steps. It thus corresponds to a "late fusion" of spatial and temporal information
- Model 3:Factorized self-attention: 在一个transformer block里，把multi-headed self-attention操作分为两个操作：一个计算self-attention spatially, 一个计算temporally。先计算spatially self-attention(among all tokens extracted from the same spatial index，然后temporally(among all tokens extracted from the same spatial index),比Model 1更高效，和Model 2 有同样的计算复杂度。计算spatial self-attention的时候，reshape tokens $z$ from $1 \times n_t * n_h * n_w * d$, input to temporal self-attention，$z_t$ is reshaped from $n_h* n_w \times n_t * d$ 。不管是spatial-then-temporal还是temporal-then-spatial，区别不大。在这个Model中，不使用 classification token，避免ambiguity when reshaping the input tokens between spatial and temporal dimensions。 $Fig.4^{[1]}$: Factorised self-attention (Model 3). Within each transformer block, the multi-headed self-attention operation is factorised into two operations (indicated by striped boxes) that first only compute self-attention spatially, and then temporally
- Model 4: Factorized dot-product attention。对于每个head，自注意力操作的定义为： \[\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{Softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}.\]
在self-attention中, queries $Q = XW_q$，keys = $K = XW_k$，values $V = XW_v$，$Q,K,V$是输入X的linear projection，$X,Q,K,V \in R^{N \times d}$，在Model 1中，spatial 和temporal dimensions are merged as $N = n_t * n_h * n_w$。

主要的idea是，修改keys and values for each query，使其only attend over tokens from the same spatial and temporal index by constructing $K_s,V_s \in R^{n_h*n_w \times d}, and \ K_t, V_t \in R^{n_t \times d}$. 一般的attention head来做：attend tokens from spatial dimension by computing $Y_s = Attention(Q,K_s,V_s)$，剩下的attend over temporal dimension by computing $Y_t = Attention(Q,K_t,V_t)$，最后通过concatenating them and using a linear porjection来结合multiple heads的输出$Y = Concat(Y_s,Y_t)W_o$。考虑到对于每个query，只改变了attention neighbourhood，attention operation有同样的dimension和unfactorized case(Model 1), namely $Y_s, Y_t \in R^{N \times d}$

$Fig.5^{[1]}$: Factorised dot-product attention (Model 4). For half of the heads, we compute dot-product attention over only the spatial axes, and for the other half, over only the temporal axis.
由于video的数据集相比于image来说，有标签的数据更少， traning large models from scratch to high accuracy是极具挑战的。为了sidestep这个问题，使训练更高效，initialize video models from pretrained image models。问题是：如何initialize parameters not present or incompatible with image models，这个问题也是我想问的，因为模型结构的不同，导致在initialize的时候容易出现参数不匹配。一些有效的策略如下：
- Position Embedding. positional embedding P 被加入到input token，但是视频model有$n_t$ 倍的tokens than pretrained image model，因此，在初始化的position embedding的时候，通过 repeating them temporally from $R^{n_w * n_h \times d}$ to $ R^{n_t * n_h * n_w d}$, 因此，所有有相同的spatial index有相同的embedding。
- Embedding weights, E: 当采用 tubelet embedding的时候， embedding filter E 是一个 3D的tensor，与预训练模型 $E_{image}$中的2D tensor不同。一个通常的将2D filter 转换为 3D的 conv filter for video classification的方法是：inflate them by replicating the filters along the temporal dimension and averaging them as: \[\mathbf{E}=\frac{1}{t}[\mathbf{E}_{\mathrm{image}},\ldots,\mathbf{E}_{\mathrm{image}},\ldots,\mathbf{E}_{\mathrm{image}}].\]
另外一个策略：称为"central frame initialization"，where $E$ is initialized with zeroes along all temporal positions, except at the centre $\lfloor \frac{t}{2} \rfloor$: \[\mathbf{E}=[\mathbf{0},\ldots,\mathbf{E}_{\mathrm{image}},\ldots,\mathbf{0}].\]

因此，3D Conv filter 能够像 uniform frame sampling effectively behaves at initialization，while also enabling the model to learn to aggregate temporal information from multiple frames as training progress.

其实跟padding也类似，有zero padding 也有replicate padding。
- Transformer weights for Model 3: 它包含了2个 multi-headed self-attention modules,因此，initialize spatial MSA module from pretrained module, initialize all weights of temporal MSA with zeros：就是用预训练的来初始化spatial MSA module，用zeros来初始化temporal MSA module。
在做实验的时候: backbone follows ViT和BERT. 推理的时候，网络的输入是video clip of 32 frames using a stride 2, process multiple views of a longer video and average per-view logits to obtain the final result. 除非特别说明, 4 views per video. 这个view好几次看到了。在SlowFast那篇论文里，提到了这个view temporal clip with spatial crop,比方说 10 temporal clips each with 3 spatial crops，这就是30 views。在本论文中，Table 6也有提到：Views: $x \times y$, denotes $x$ temporal crops and $y$ spatial crops.
实验中：主要是用Model 2来做的对比，在几个数据集上效果还可以。tubelet embedding using central frame 表现比 filter inflation、uniform frame sampling要好。由于pure-transformer的架构需要大量的数据，在小的数据集上容易过拟合，所以需要进行正则化，（数据增强或者减小模型复杂度），实验中用了 stochastic depth, random augment，label smoothing, mixup等方法。

$Fig.6^{[1]}$:Wepropose a pure-transformer architecture for video classification, inspired by the recent success of such models for images. To effectively process a large number of spatio-temporal tokens, we develop several model variants which factorise different components of the transformer encoder over the spatial- and temporal-dimensions. As shown on the right, these factorisations correspond to different attention patterns over space and time.

ViViT: A Video Vision Transformer[1]

Time

Key Words

动机

总结

ViViT: A Video Vision Transformer^[1]