Exploring Plain ViT for Object Detection
Exploring Plain Vision Transformer Backbones for Object Detection[1]
作者是来自FAIR的Yanghao Li, Hanzi Mao, Ross Girshick和Kaiming He. 论文引用[1]:Li, Yanghao, et al. "Exploring plain vision transformer backbones for object detection." European conference on computer vision. Cham: Springer Nature Switzerland, 2022.
Time
- 2022.Mar
Key Words
- Plain ViT for Object Detection
总结
- 作者探索了plain, non-hierarchical
ViT作为backbone,用于object
detection,这个涉及使得原始的ViT架构能够被fine-tuned,用于object
detection,不需要重新设计一个hierarchical backbone for
pre-training。只需很小的adaptations for fine-tuning,这个plain-backbone
detector能够实现很好的结果。作者观察到:
- 从单个尺度的feature map构建一个simple feature pyramid就足够了,不需要FPN的设计
- 用window attention(without shifting),辅以很少的cross-window propagation blocks就足够了。