DINO

发表于 2024-04-02 更新于 2025-04-29 分类于 Papers 阅读次数：本文字数： 920 阅读时长 ≈ 3 分钟

Emerging Properties in Self-Supervised Vision Transformers^[1]

作者是来自FAIR、Inria和Sorbonne University的团队，论文引用[1]：Caron, Mathilde et al. “Emerging Properties in Self-Supervised Vision Transformers.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 9630-9640.

Time

2021.Apr

动机

Transformer在视觉里的成功是否由于在pretraining里的supervision。Transformer在NLP里的成功的一个主要因素是自监督预训练。
作者研究了自监督预训练 on ViT features. ### Key Words

Self-supervised ViT features
self-distillation with no labels (DINO)

总结

在ViT上的自监督预训练的特点，没有出现在supervised ViTs上的：
- explicitly包含了scene layout 和，object boundaries,这个信息主要是在最后一个block的自注意力模块。
- self-supervised ViT 用一个基本的k-NN就能在ImageNet上实现78.3%的准确率，补血药任何fintuning、线性分类器或者数据增强。
用k-NN实现很好的性能是在和momentum encoder和multi-crop augmentation结合情况下实现的。用smaller patches with ViTs能够提高resulting features的质量。

提出的简单的自监督的方法：DINO, 直接预测output of teacher network-built with momentum encoder by using a standard cross-entropy loss, 简化了自监督训练。同时该框架很灵活，能够work on both convNets 和ViTs，不需要修改架构或者改变internal normalizations
这里涉及到了自监督和知识蒸馏，有个teacher network 和student network,有点像Siamese 网络那样，弄2个分支，还不是很了解distillation相关的，还要多看看。
self-training: 目的在于通过传播小的初始的标注到很多的无标注的实例，这个propagation能够通过hard assignments of labels，或者soft assignments of labels实现，当用soft labels时，这个方法通常被称为知识蒸馏(knowledge distillation)。用来设计训练一个小的网络来模仿大的网络的输出，从而压缩模型。作者的工作是与codisitllation相关，student和teacher网络的架构都相同，在训练的时候用蒸馏。
对于一个给定的图像，产生V个不同的views，包含2个global views(x1,x2)和一些local views, crops 给到student, only global 给到 teacher，然后进行local-global correspondences, 最小化损失.

\[\begin{aligned}\min_{\theta_s}H(P_t(x),P_s(x)),\\\text{where}~H(a,b)=-a\log b.\end{aligned}\] 具体是： \[\min_{\theta_s}\sum_{x\in\{x_1^g,x_2^g\}}\sum_{\begin{array}{c}x^{\prime}\in V\\x^{\prime}\neq x\end{array}}H(P_t(x),P_s(x^{\prime})).\]

这个设置是最基本的DINO的参数化

Avoid collapse:这个不是很明白，自监督的方法通过一些操作来避免collapse，例如:对比损失、clustering constraints、预测器或者batch normalizations。作者这里用了centering和sharpening来避免此问题。
对于自监督学习来说，标准的评估的protocols是在frozen features上用一个线性分类器或者在下游任务上finetune features。 ### 结论
two properties:
- k-NN 分类的特征的质量对于图像检索是有潜力的
- 特征里的scene layout有助于弱监督的图像分割
结果：有证据表明：自监督学习是一个开发基于ViT的、BERT类型的model的一个重要的点。

\(Fig.1^{[1]}\) Self-distillation with no labels. We illustrate DINO in the case of one single pair of views (x1, x2) for simplicity. The model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but different parameters. The output of the teacher network is centered with a mean computed over the batch.Each networks outputs a K dimensional feature that is normalized with a temperature softmax over the feature dimension. Their similarity is then measured with a cross-entropy loss. We apply a stop-gradient (sg) operator on the teacher to propagate gradients only through the student. The teacher parameters are updated with an exponential moving average (ema) of the student parameters.

Emerging Properties in Self-Supervised Vision Transformers[1]

Time

动机

总结

Emerging Properties in Self-Supervised Vision Transformers^[1]