BEiT

发表于 2024-03-18 更新于 2025-04-29 分类于 Papers 阅读次数：本文字数： 923 阅读时长 ≈ 3 分钟

BEiT: BERT Pre-Training of Image Transformers^[1]

论文的作者是来自哈工大和微软的Hangbo Bao, Li Dong, Songhao Piao 和 Furu Wei。论文引用：Bao, Hangbo et al. “BEiT: BERT Pre-Training of Image Transformers.” ArXiv abs/2106.08254 (2021): n. pag.

Time

2021.Jun

Key Words

Self-supervised vision representation model:BEiT
pre-training task: masked image modeling(MIM)
two views of image representation: image patches(input) and visual tokens(output)

针对的问题

直接将BERT的思路用于image data有挑战：
- there is no pre-exist vocabulary for ViT's input unit,i.e. image patches.所以不能简单地用一个softmax分类器来预测所有可能的masked patches的candidates.
- 一个直接的思路是将任务视为回归问题，预测masked patches的raw pixels，然后pixel-level 的恢复任务tends to waster modeling capability on pre-training shortrange dependencies and high-frequency details. ### 总结
BEiT: Bidirectional Encoder representation from Image Transformers following BERT. In order to solve the data-hungry issue, self-supervrised have been explored for vision Transformers, such as contrastive learning and self-distillation. Proposed pre-training task, namely, masked image modeling(MIM)
split image into a grid of patches that are the input representation of backbone Transformer. Moreover, tokenize the image into discrete visual tokens, which is obtained by the latent codes of discrete VAE. 在预训练的时候，随机对一些image patches进行mask，将corrupted input 给到Transformer，模型从学习恢复原始图像的visual tokens，而不是 masked patches的raw pixels，这一点和MAE_st不同，MAE_st是恢复raw pixels，像素级的重建。
执行自监督学习，自监督学习的时候不需要labels，然后在两个下游任务上--图像分类、语义分割--对预训练的BEiT模型进行微调。
BEiT能够通过预训练学习reasonable语义区域，unleashing丰富的包含在图像中的监督信号。
Contributions:
- 提出了MIM用自监督的方式来预训练ViT，从VAE的角度提供了理论解释
- 预训练BEiT，在下游任务做了很多fine-tuning的实验
- 提出了BEiT的自注意力机制，来区别语义区域和目标边界，不需要人类的标注。
Visual Tokenizer: tokenize the image \(x\) into \(z\), vocabulary \(v\) contains discrete token indices. discrete variational autoencoder(dVAE)'s two modules: tokenizer, decoder.
- tokenizer maps image pixels \(x\) into discrete tokens \(z\) according to visual codebook(vocabulary)
- decoder基于visual tokens \(z\) 重建输入图像\(x\)，
- 同一幅图像里的visual tokens和image patches的数量是相同的，vocabulary size is \(|v| = 8192\).
randomly mask approximately \(40\%\) image patches,最后的hidden vectors \(h^L_i\) 作为输入patches的encoded representation，对于每个masked position, 用一个Softmax分类器来预测对应的visual tokens。预训练的目标是最大化给定的corrupted image的visual tokens的log-likelihood: \[\max\sum_{x\in\mathcal{D}}\mathbb{E}_\mathcal{M}\left[\sum_{i\in\mathcal{M}}\log p_\mathrm{MIM}(z_i|x^\mathcal{M})\right]\]
用blockwise masking而不是随机的masking。pixel-level auto-encoding会使模型聚焦于short-range dependencies和high-frequency details.BEiT 通过预测离散的visual tokens克服了以上的问题。
BEiT pretraining: \[\sum_{(x_i,\tilde{x}_i)\in\mathcal{D}}(\underbrace{\mathbb{E}_{z_i\sim q_\phi(z|x_i)}[\log p_\psi(x_i|z_i)]}_{\text{Stage l: Visual Token Reconstruclion}}+\underbrace{\log p_\theta(\hat{z}_i|\tilde{x}_i)}_{\text{Stage 2: Masked Image Modeling}})\]
下游任务：

在预训练BEiT之后，在Transformer上加一个task layer，在下游任务上fine-tune参数。 - 图像分类：汇聚完representations之后，用一个softmax分类器。 - 语义分割：follow task layer used in SETR-PUP

self-attention mechanism in BEiT can seperate objects.

BEiT Figure 1^[1]: Overview of BEIT pre-training. Before pre-training, we learn an “image tokenizer” via autoencoding-style reconstruction, where an image is tokenized into discrete visual tokens according to the learned vocabulary. During pre-training, each image has two views, i.e., image patches, and visual tokens. We randomly mask some proportion of image patches (gray patches in the figure) and replace them with a special mask embedding [M]. Then the patches are fed to a backbone vision Transformer. The pre-training task aims at predicting the visual tokens of the original image based on the encoding vectors of the corrupted image

BEiT: BERT Pre-Training of Image Transformers[1]

Time

Key Words

针对的问题

BEiT: BERT Pre-Training of Image Transformers^[1]