MViT

发表于 2024-04-21 更新于 2024-09-24 分类于 Papers 阅读次数：本文字数： 1.2k 阅读时长 ≈ 4 分钟

Multiscale Vision Transformer

MViT的作者是来自FAIR和 UC Berkeley的Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhichegn Yan, Jitendra, Malik, Christoph Feichtenhofer。MViTv2的作者也是他们，多了Chao-Yuan Wu. 论文引用[1]:Fan, Haoqi et al. “Multiscale Vision Transformers.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 6804-6815. [2]:Li, Yanghao et al. “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 4794-4804.

Time

MViT：2021.Apr
MViTv2: 2021.Dec

MViT

动机

posit that the fundamental vision principle of resolution and channel scaling, can be beneficial for transformer models across a variety of visual recognition tasks.

Key Words

connect seminal idea of multiscale feature hierarchy with transformer model
progressively expand the channel capacity, while pooling the resolution from input to output of the network.

总结

starting from input resolution and a small channel dimension, stages hierarchically expand channel capacity while reducing the spatial resolution. 构建一个多尺度的特征金字塔，with early layers operating at high spatial resolution to model simple low-level visual information, deeper layers at spatially coarse, but complex, high-dimensional features.
基于猫和猴子的视觉皮层的研究，Hubel and Wiesel开发出了一个hierarchical model of the visual pathway with neurons in lower areas such as V1 responding to features such as oriented edges and bars, in higher areas to more specific stimuli. Fukushima 提出了Neocognitron，他的模型有交替的simple cells 和complex cells的layers，incorporating downsampling。LeCun加了一个额外的步骤：用反向传播来训练网络的weights。视觉处理的hierarchy的主要方面已经建立了：
- processing hierarchy的时候，空间分辨率减小。
- 不同channels的数量增加，每个channel都对应于more specialized features.
计算机视觉领域也开发出了multiscale的processing, called pyramid. 两个motivations:
- 为了减小计算要求，通过working at lower resolutions
- better sense of context at the lower resolutions, could then guide the processing at higher resolutions.
将multiscale feature hierarchies与transformer 连接起来。不同于常规的Transformer: channel 和resolution在整个网络中都不变; MViT有不同的channel-resolution scale stages: hierarchically expand the channel capacity while reducing the spatial resolution. MViT的fundamental advantages arises from the extremely dense nature of visual signals. 一个noteworthy benefit 是在video multiscale models中的strong implicit temporal bias: 当在natural video上训练，在shuffled frames上测试时，Vision transformer models没有显示出性能下降。表明这些模型没有有效利用时间信息。相反，MViT 在shuffled frames上测试时，性能明显下降，说明利用了时间信息。 MViT表现出来很好的性能，不需要任何额外的预训练数据。
Multi Head Pooling Attention:MHPA
- enables flexible resolution modeling in a transformer block. MHPA pools sequence of latent tensors 来减小attended input的序列的长度(resolution).
- 在 attending input之前，中间tensor: \(\hat{Q}, \hat{K},\hat{V}\) 通过pooling operator \(P\) 进行pool操作。
\(Fig.1^{[1]}\). Pooling Attention is a flexible attention mechanism that allows obstaining the reduced space-time resolution(\(\hat{T}\hat{H}\hat{W}\)) of the input (THW) by pooling the query, \(Q=\mathcal{P}(\hat{Q};\boldsymbol{\Theta}_{Q})\), and/or computes attention on a reduce length \(\tilde{T}\tilde{H}\tilde{W}\) by pooling the key, \(K=\mathcal{P}(\hat{K};\mathbf{\Theta}_{K})\), and value, \(V=\mathcal{P}(\hat{V};\boldsymbol{\Theta}_{V})\), sequences.
- 将input tensor \(L = T \times H \times W\) to \[\tilde{\mathbf{L}}=\left\lfloor\frac{\mathbf{L}+2\mathbf{p}-\mathbf{k}}{\mathbf{s}}\right\rfloor+1\]
- Query Pooling: 目的是在stage的开始减小resolution，然后保持这个resolution throughout the stage. 只有每个stage的第一个pooling attention operator 在non-degenerate query stride \(s^Q > 1\)上操作，其它的operators限制在了 \(\mathbf{s}^{Q}\equiv(1,1,1)\)
- Key-Value pooling：不同于Query Pooling，改变K和V tensors的sequence length并不改变输出sequence length。将K,V, Q pooling的用法进行解耦，Q pooling 只用在每个stage的第一个layer，K,V pooling被用在其它所有的layers，因为key 和value tensors的sequence length需要是一样的，来做attention weight calculation，用在K和V tensors上的pooling stride也需要是相同的。在默认设置中，限制所有的pooling parameters (k,p,s) be identical，但是vary s adaptively w.r.t to the scale across stages(随着不同stage的scale 自适应地变化)。
- Skip connections：因为channel dimension和sequence length在一个residual block中改变，将skip connection进行pool,来适应两端dimension 的mismatch。MHPA通过在residual path上增加一个pooling operator \(\mathcal{P}(\cdot;\boldsymbol{\Theta}_{Q})\) 来处理这个mismatch。为了处理不同stage之间的channel dimension的mismatch，用一个额外的linear layer在MHPA的layer-norm的输出上进行操作，这个不同于其它的skip connection(operate on the un-normalized signal)。
- Network details：所有的pooling操作，resolution downsampling，都只在data sequence上进行，不涉及class token embedding。在每个stage transition，之前stage输出的MLP dimension增加2倍，MHPA 在下一个stage的输入上，pools on \(Q\) tensors with \(s^Q = (1,2,2)\)

MViT \(Fig.2^{[1]}\). Multiscale Vision Transformers learn a hierarchy from dense (in space) and simple (in channels) to coarse and complex features. Several resolution-channel scale stages progressively increase the channel capacity of the intermediate latent sequence while reducing its length and thereby spatial resolution.