HieraViT

发表于 2024-04-07 更新于 2024-10-03 分类于 Papers 阅读次数：本文字数： 1.6k 阅读时长 ≈ 6 分钟

Hiera: A Hierachical Vision Transformer without the Bells-and-Whistles^[1]

作者是来自Meta、Georgia Tech和John Hopkins 的Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer. 论文引用[1]:Ryali, Chaitanya K. et al. “Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles.” ArXiv abs/2306.00989 (2023): n. pag.

Time

2023.Jun

Key Words

visual pretext task:MAE
hierarchical(multiscale) vision transformer
Mask unit attention vs Window attention
add spatial bias by teaching it to model using a strong pretext task like MAE instead of vision-specific modules like shifted windows or convs.
一句话总结：就用MAE对MViTv2(encoder)进行预训练, 而不是vanilla ViT，同时去掉了MViTv2中的一些设计，用了mask unit attention，取得了很好的效果。

动机

现在很多hierarchical ViT加了很多vision-specific component，为了追求监督的分类性能，同时这些components带来了不错的精度和FLOP counts，增加的复杂度使得这些transformer比对应的vanilla ViT更慢。作者argue: 这些额外的bulk是不必要的，通过预训练一个强的visual pretext task(MAE)，可以去掉很多花里胡哨的东西，同时也不降低精度。所以作者提出了Hiera.

总结

ViT的simplicity有成本的(comes at a cost): 整个网络用相同的spatial resolution和通道数。ViTs没有有效利用这些参数。这和之前的"hierarchical" 或者"multi-scale"不同。后者是在early stages：用更少的channels和更高的spatial resolution、在之后用更多的channles和更小的spatial resolution(with more complex features)。有一些模型用了hierarchical的设计，但是，增加的复杂度是模型整体更慢了。
作者认为这些bulk是不必要的，因为ViTs在patchify 操作之后，缺少inductive bias，很多的ViTs模型是人为的加入了一些spatial bias。如果能训练模型学习这些(bias)，为什么要让模型变慢来加入这些biases？ MAE的预训练是一个有效的工具来使得ViTs学习spatials reasoning。MAE 预训练是稀疏的，比常规的监督训练速度快，使得它在很多领域是一个理想的选择。用现在的hierarchical ViT进行测试，当训练MAE的时候，逐步去掉non-essential components，发现：No convolutions, No shifted or cross-shaped windows, No decomposed relative position embeddings，Pure simple hierarchical ViT,兼具快速和高精准度。MAE结合ViT，能够实现高效的训练，using sparsity of masked images。
思路: 之前的工作为了提高精度，设计了很多复杂的架构，添加一些module来增加spatial bias。作者采用了不同的策略：用简单的模型，通过一个strong pretext task来学习biases。训练pretext task的时候，逐步去掉bells-and-whistles。
有几个地方不太理解：masked tokens are deleted instead of being overwritten like in other masked image modeling.masked token ~~被删了？？不是对masked token进行复原吗？为什么要删~~, 在进行预训练的时候，masked的tokens去掉了，没有masked tokens留下来，然后在decoder中，对留下来的tokens的encoded patches和masked tokens进行reconstruct。这个pose a problem for existing hierarchical models： it breaks 2D grid that they rely on; 另外MAE masks out individual tokens, which are large $16 \times 16$ patches for ViT, small $4 \times 4$ for most hierarchical models。区分tokens和"mask units"。这里不太明白。。"mask units 是应用MAE masking时的 resolution， tokens时model的internal resolution.
作者用MViTv2作为base architecture. MViTv2：
- 是一个hierarchical model。它有四个stages来学习multi-scale representations。
- 开始时：small channel/high spatial resolution.
- MViTv2的一个特点是pooling attention。features are locally aggregated --- 用 $3 \times 3$ Conv，before 计算自注意力。在pooling attention的时候，K和V在前两个阶段被pool，来减小计算，Q也被pool，MViTv2也用了decomposed relative positino embeddings instead of absolute ones。
在应用MAE的时候, 因为MViTv2是被 $2$3次下采样，用了 $4 \times 4$ pixel的token size. 作者用mask unit of size $32 \times 32$, 这个ensures each mask unit 对应$ 8^2 , 4^2, 2^2,11$ in stages 1,2,3,4.能够使每个mask unit to cover 至少一个distinct token in each stage to make sure 卷积核不会 bleed into deleted tokens。不是很明白这里的 mask unit 和 token。
在简化MViTv2的时候：
- MViTv2用了相对的position embeddings 加在每个block的attention上。作者直接用绝对的position embedding来做
- 作者为了去掉模型里的卷积，用maxpool来替代卷积。只有在stage transition中的 for Q的pooling layer和前两个stage的 KV pooling是保留了。
- 去掉overlap:剩下的maxpool layers的kernel size是 $3 \times 3$，使得separate-and-pad trick变得有必要。通过使kernel size == stride for each maxpool，能够避免这个问题
- Mask Unit Attention: Pooling Q对于保持hierarchical model是必要的，但是KV poooling是用来减小attention matrix的size。去掉的话会增加网络的计算成本，用local attention with a mask unit来代替。在MAE预训练的时候，在网络的一开始就将mask units分离出来了，mask unit attention不同于window attention，因为它使window size与当前resolution的mask units的size相适应，window attention在整个网络是一个固定的size，which would leak into deleted tokens after a downsample.
- 着重解释一下Mask unit attention：从代码里看，在patch完之后，encoder的输入为: [B,N,C]，然后给定了windows size之后，可以得到num_windows，公式有： windows_size * num_windows = N；所以之后的 q,k, v tensor的size为：[B,num_heads,num_windows,window_size,dim_per_head]。q,k,v的dim和qkv的dim的关系为:dim_per_head * num_heads * 3。
Multi-Scale Decoder：Hiera通过融合所有stage的representations 来使用 multi-scale 的信息，而MAE是用最后一个block的encoder的tokens输给decoder.

Hiera-B $ Fig.1^{[1]}$ Hiera Setup. Modern hierarchical transformers like Swin (Liu et al., 2021) or MViT (Li et al., 2022c) are more parameter efficient than vanilla ViTs (Dosovitskiy et al., 2021), but end up slower due to overhead from adding spatial bias through vision-specific modules like shifted windows or convs. In contrast, we design Hiera to be as simple as possible. To add spatial bias, we opt to teach it to the model using a strong pretext task like MAE (pictured here) instead. Hiera consists entirely of standard ViT blocks. For efficiency, we use local attention within “mask units” for the first two stages and global attention for the rest. At each stage transition, Q and the skip connection have their features doubled by a linear layer and spatial dimension pooled by a 2 × 2 maxpool.

Window attn and Mask Unit attn $Fig.2^{[1]}$ Mask Unit Attn vs. Window Attn. Window attention (a) performs local attention within a fixed size window. Doing so would potentially overlap with deleted tokens during sparse MAE Pretraining. In contrast, Mask Unit attention(b) performs local attention within individual mask units, no matter their size.

Hiera: A Hierachical Vision Transformer without the Bells-and-Whistles[1]

Time

Key Words

动机

总结

Hiera: A Hierachical Vision Transformer without the Bells-and-Whistles^[1]