BEiT
BEiT: BERT Pre-Training of Image Transformers[1]
论文的作者是来自哈工大和微软的Hangbo Bao, Li Dong, Songhao Piao 和 Furu Wei。论文引用:Bao, Hangbo et al. “BEiT: BERT Pre-Training of Image Transformers.” ArXiv abs/2106.08254 (2021): n. pag.
Time
- 2021.Jun
Key Words
- Self-supervised vision representation model:BEiT
- pre-training task: masked image modeling(MIM)
- two views of image representation: image patches(input) and visual tokens(output)
针对的问题
直接将BERT的思路用于image data有挑战:
- there is no pre-exist vocabulary for ViT's input unit,i.e. image patches.所以不能简单地用一个softmax分类器来预测所有可能的masked patches的candidates.
- 一个直接的思路是将任务视为回归问题,预测masked patches的raw pixels,然后pixel-level 的恢复任务tends to waster modeling capability on pre-training shortrange dependencies and high-frequency details.