MViT
Multiscale Vision Transformer
MViT的作者是来自FAIR和 UC Berkeley的Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhichegn Yan, Jitendra, Malik, Christoph Feichtenhofer。MViTv2的作者也是他们,多了Chao-Yuan Wu. 论文引用[1]:Fan, Haoqi et al. “Multiscale Vision Transformers.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 6804-6815. [2]:Li, Yanghao et al. “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 4794-4804.
Time
- MViT:2021.Apr
- MViTv2: 2021.Dec
MViT
动机
- posit that the fundamental vision principle of resolution and channel scaling, can be beneficial for transformer models across a variety of visual recognition tasks.
Key Words
- connect seminal idea of multiscale feature hierarchy with transformer model
- progressively expand the channel capacity, while pooling the resolution from input to output of the network.