Self-Guided Masked Autoencoder

发表于 2025-05-12 分类于 Papers 阅读次数：本文字数： 2.5k 阅读时长 ≈ 9 分钟

Self-Guided Masked Autoencoder^[1]

作者是来自Google和首尔国立大学的Jeongwoo Shin等人，论文引用[1]:Shin, Jeongwoo et al. “Self-Guided Masked Autoencoder.” Neural Information Processing Systems (2024).

Time

Key Words

Masked Autoencoder

总结

MAE是用于表征学习的一种自监督的方式，广泛地应用于CV中的下游任务。尽管它很成功，但是，但还是没有完全揭示它是如何学习的。在本文中，作者做了深入的分析，发现：MAE从pretraining早期阶段，学习patern-based patch-level clustering。基于这个理解：作者提出了self-guided masked autoencoder，通过利用patch clustering中的progress，内在地产生informed mask，代替原始的MAE的随机的masking，作者的方法不需要依赖任何外部的models或者supplementary information，显著地提高了它的learning progress，完好地保持了MAE自监督的本质的优势。

自监督学习是一个attractive direction，用于缓解数据标注的高昂的成本。例如MLM，预测输入sentence的masked word, BERT和GPT用来capture contextual meaning of a word。受MLM的启发，MIM也引入了CV，利用大量的无标注的数据。MAE用一个ViT的非对称的encoder-decoder结构，展示了简单地重建RGB pixels for masked patches是足以在多个下游任务上是实现有竞争力的性能。在MAE取得了impressive performance之后，通过结合informed masking techniques，增强它的capabilities的一系列研究出现了。这些创新利用了额外信息的多种来源，包括一个监督的ViT生成的attention maps，预训练自监督model学习的knowledge，或者supplementary adversarial modules，所有的都是针对提高masks的质量，然而，这些流行的方法没有完全理解MAE的机制，仅仅应用informed masking。

为了这个目的，作者做了大量的实验和深入的分析，来理解MAE的内在的操作，因为没有完全揭示MAE学的什么以及如何学习的，尽管有一些先前的努力，基于作者的MAE的分析，探索了MAE的潜力，来产生自己的informed masks，作者首先展示了：MAE内在学习pattern-based patch-level clustering，这个特点来自于extremely early stages of pretraining，作者揭示了decoder中的mask tokens的潜在的机制，基于这个理解，作者提出了一个新的方法，通过informed masks，来提高training的过程，不需要额外的models或者补充信息，以完全无监督的方式生成。不同于之前的informed masking 方法，不同于random masking，作者的方法产生的informed masks，覆盖了main objects entirely, 用了可区分的、来自训练早期阶段的patch representations，由了内在产生的informed masks，MAE学习patch-level clustering，加速了训练过程，导致了clearer 和finer embedding space，作者的贡献如下：
- 作者发现, MAE在每个image中，学习pattern-based patch-level clustering，这是来自预训练过程的早期阶段。
- 作者提出了一个新的masking 方法，self-guided masked autoencoder，仅依赖于patch-clustering中的progress的内在的量化。
Hiearchical Latent Variable Model：有人最近发现：MAE的内在操作可以在hierarchical latent variable model框架下进行解释。在input image中存在high-level shared information $c$，它等价于patches中的statistical dependency，MAE encoder $E(X_v)$ 通过估计visible patches $X_v$ 的共享信息 $\hat{c}$，来学习high-level latent variables, decoder $D([E(X_v);m])$ 通过从 $\hat{c}$ 中得到的 $X_m$，通过 mask tokens进行重建任务。
作者研究： MAE是如何学习token relation的concept，展示了它学习pattern-based patch-level clustering from early stages of training. 作者解释了MAE decoder的潜在的机制。
- Token Relation：作者分析了token embeddings和它们的quantified pair-wise relationships，例如attention score matrix A和cosine similarity matrix C。对于输入的patches和transformer weights for queries、keys和values，作者分析了A和M，首先，作者用了encoder中的patches X的complete set，记为 $E(X)$,这个ideal setting提高了最精确地 $\hat{c}$, 合适分析MAE学到的features，另外，它们可以从decoder中国的practical setting中得到，包含了mask tokens。因为只有visible tokens利用estimate $\hat{c}$, 这个setting相比于之前的，会产生less accurate token relations。
- MAE学到了啥：在learned embedding space中，研究了patch relationships的分布，用 $E(X), D([E(X_v);m])$的最后一层embeddings for patches of set-aside test images. 在图中，作者比较了不同model的patch representations，表明: MAE encoder基于visual patterns学习到了patch clustering，texture and color。总结：MAE基于它们的visual patterns学习image重点 patch-level clustering，仅对visible tokens进行操作，decoders学习到了一个类似的，但没有encoder那么clear的trend。主要是作者对比了两个指标: feature variance $\sigma_F$ 和variance of the pairwise similarities $\sigma_S$。
- MAE什么时候学习patch clustering的：作者通过tracking token relations of MAE，回答了这个问题。作者从嘴贱的token clustering开始，例如bi-partitioning，通过对M用graph-cut，将patches聚类为两个最主要的sub-groups，基于这个clustering，作者trace了inter-cluster edge weights和均值和intra-cluster edge weights的均值。结果显示了关于gap ($\mu_{intra} - \mu_{inter}$)两个notable patterns: 随着training steps，这个gaps倾向于变大，attention scores更显著；$\mu_{intra}$ 和$$\mu_{inter}$ 在很早期的阶段就有明显的margin，decoder也展示了类似的less promoinent的趋势。
作者直接track token relations之间的gap，特别地，作者考虑在第j个epoch的、第i层的KL散度，$N$ 是总的epochs，结果表明: MAE在早期阶段就学习token relations，在余下的training中，逐步增强它，decoder也展示了类似的趋势。
- 在之前的实验中，作者观察到，decoder能够构建完整的token relation，验证了：decoder利用了encoder出来的shared information $\hat{c}$，来补全masked-out tokens的缺失的信息并重建它们，将这个和之前的发现联系起来，作者claim： MAE学到的pattern-based patch clustering和这个 $\hat{c}$ 相对应，如果encoder训练地充分，它输出的embeddings for visible tokens将会传递出整个image的general context，因此，通过decoding过程，mask tokens通过选择性地和 $X^`_v$进行attending，被contextualized，处理essential information来表征target patches $X_m$。因此，通过逆向这个过程，作者评估：encoder是否被充分地训练，通过量化 $\hat{c}$，来精确地关联patches。基于这个idea，作者提出了一个新的metric，在训练的时候评估它。
作者定义了exploitation rate of mask tokens over decoder layers,用attention score matrix A，可以解释为attention rollout的特殊的case，对于token indices $A$ 和 $B$ 的集合，利用率exploitation rate表示：A 中的用于重构B中的tokens的tokens，定义为average attention weights.

作者测量了visible tokens和mask tokens的利用率，在每个decoder layer中，对mask tokens的heavy exploitation表明：它们有encoder estimated的大量的shared information，对简单的visible patches的插值更有价值。总结：当encoder被训练的足够充分来聚类patches的时候，encoder outputs反应了shared information，它们用来构成decoder中的mask tokens。这意味着mask tokens处理这个patch clustering information，被大量用来重建masked-out patches。因此，作者从decoder中mask tokens的高exploitation rate推断：mask tokens有patch clustering information conveyed from the encoder，足够来聚类patches。这个过程通过测量encoder学习到的shared information来验证了。decoder中的mask tokens的heavy exploitation表明：encoder被训练地足够充分来聚类patches。
Self-Guided Informed Masking：作者展示了，MAE 从早期阶段就学习了patch clustering，使得可以将image分为两个主要的token clusters，然后mask掉其中的一个。换句话说，可以**在预训练的早期阶段，用MAE产生informed masks，将这些informed masks用于剩余的training的训练。为了决定MAE什么时候能properly cluster patches，用之前提到的exploitation rate，使得能够在 T epochs产生informed masks，这就是作者设计的方法。

作者利用早期阶段学到的patch relevance来加速训练，不是依赖于random masking。random masking 延迟了powerful patch-clustering的学习，导致模型低效地访问早期阶段已经聚类的patches，这些patches反应了image tokens之间的key dissimilarities。

基于这个idea，作者提出了self-guided informed masking，将top two的well-separated clusters其中之一mask掉，注入关于learned key dissimilarities。在epoch T，开始生成informed masks，继续训练。

有了这个方法，能够加速MAE，聚焦于学习less distinguishable patches，而不是浪费时间重复发现most prominent patterns，作者的方法仅依赖于内在的metrics，完全不需要任何外在的models或者extra information。

为了实现这个，需要：1. bi-partition the image; 2. 设计informed masks; 3. 选择attention layer来构建informed masks; 4. 决定什么时候开始informed masking。
- bi-partition: 为了将image进行bi-partition，作者用Normalized-Cut来考虑不同cluster之间的dissimilarity和每个cluster内部的similarity。构建了一个全连接的无向的image graph，patches和similarity分别作为nodes和edges。为了将所有node indices划分为两个disjoint sets，最小化Ncut energy。通过计算第二小的eigenvector，来近似解这个problem。
- 作者将masking限制在object-centric regions，通过narrowing masking focus，作者的方法指导MAE关注学习patch clustering within object regions，这样，loss只影响object-related parts，加速了object region中的patch clustering的学习。在context中，作者旨在mask out 包含Main object的cluster，使得model更快地学习feature representation，因为没有label，作者采用了一个简洁的方法: 有最大的absolute element in $y_1$ 构成了main object。
实际上，很多实际的问题，例如bi-partition不完美，cluster size不一样，使得batch processing变得复杂，作者将和 $C$ 相关的tokens进行rank，基于ranking，mask out一个固定比例的tokens。
- Appropriate Layer for Patch Clustering：作者考虑attention distance和Normalized Mutual Information(NMI)，来决定哪一层的layer embeddings用来计算patches之间的similarity matrix，为了得到足够的meaningful token relation，作者丢弃了early layers，选择了倒数第二个encoder layer。

Illustration of our self-guided MAE $Fig.1^{[1]}$.

Self-Guided Masked Autoencoder[1]

Time

Key Words

总结

Self-Guided Masked Autoencoder^[1]