MoVE-KD

发表于 2025-04-14 分类于 Papers 阅读次数：本文字数： 3k 阅读时长 ≈ 11 分钟

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Viusal Encoders^[1]

作者是来自北大等机构的Jiajun Cao等人，论文引用[1]: Cao, Jiajun et al. “MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders.” ArXiv abs/2501.01709 (2025): n. pag.

Time

2025.Mar

Key Words

Single Vision encoder
LoRA
MoE

总结

visual encoders是VLMs中的重要组件，每个都是从pretrained visual foundation models中得到的，展示出了unique strengths。为了利用这些encoders的various capabilities，最近的研究在单个VLMs中引入了多个encoders，导致在计算成本上的增加了很多，作者提出了Mixture of Visual Encoder Knowledge Distillation(MoVE-KD)，一个新的framework，将多个vision encoders的unique proficiencies蒸馏到a single, efficient encoder model中，特别地，为了缓解conflicts和保持每个teach encoder的unique characteristics，作者采用了LoRA和MoEs，来选择性地激活specialized knowledge based on input features，增强了adaptability和efficiency，为了正则化KD process和增强performance，作者提出了一个attention-based distillation strategy，自适应地weights the different encoders，强调了valuable visual tokens，缓解了replicating comprehensive but distinct features from multiple teachers的负担。

VLMs的快速发展推动了AI的研究，特别是在一些需要结合visual 和linguistic understanding的任务中，vision encoder是这些Model的核心，对于visual perception很重要，构成了interpreting visual inputs和使得vision-language tasks有效执行的基础。最近的研究强调了多个vision encoders的独特的优点，例如CLIP、EVA还有ConvNeXt，每个都在特定的vision-language applications中有很好的表现，这个diversity使得这些模型的优化和集成成为了一个key area of research。 为了利用多个vision encoders的proficiency，当前的方法通常采用多个encoders in a vision-language model，通过feature concatenation 或者attention 机制，然而，相比于单个vision encoder的VLMs，用多个encoders不可避免地增加了计算成本和模型复杂度，降低了效率和可扩展性，为了解决这个问题，作者探索了一个问题: 是否能够将多个encoders的unique proficiency蒸馏到一个single efficient encoder model中，得到collective advantages，同时提高效率。

为了unify multiple encoders into one，Knowledge distillation提出了一个promising的方法，它有效地将teacher model的knowledge transfer到一个student model，然而，经典的KD方法聚焦于one-to-one distillation，同时从多个models中进行蒸馏，每个都有distinct pretraining datasets和objectives，是一个相对under-explored。尽管AM-RADIO提出了，用single model的多个heads，来代替多个vision foundation models的predictions，它的性能受限于learning diverse 和共享backbone的competing characteristics的冲突。

作者的方法通过knowledge distillation from multiple pretrained visual foundation models，对base model进行微调，用一个mixture-of-LoRA-experts(MoLE)的框架，在这个框架中，base model通过多个low-rank adapation experts进行调整，缓解了灾难性的以往问题，基于输入的characteristics，能够选择性地被激活。这种设计使模型能够动态地利用每个teacher encoder的优势和专业化见解，从而实现一种连贯且高效的单编码器结构。

除了student-side selective adaptation，从teacher model中identify 和refine valuable features也是重要的，作者提出用attention机制，来指导knowledge distillation from teachers。特别地，作者用[CLS] token来得到每个visual token的重要性，用了一个加权的distillation loss，优先考虑来自teachers的valuable tokens。另外，作者用了visual tokens的平均重要性，作为weighting factor，来平衡multiple teacher的contributions，换句话说，average importance高的teachers for a given sample，在accuracy中会着重考虑。这个selective distillation确保了只有valuable information from teachers被student吸收，有效地增强了从多个teachers中压缩knowledge到单个model的能力。

作者的方法，Mixture-of-Visual-Encoder KD, MoVE-KD，有效地集成了多个encoders的长处，同时保持单个models的效率。

Contributions如下：
- 提出了一个MoVE-KD的框架，用于multi-vision encoder fusion，是首个从知识蒸馏的角度，集成different encoders，用于large vision-language models
- 作者引入了attention-guided KD 正则化，增强了critical visual tokens的distillation，给每个teacher分配adaptive weight，另外，作者引入了Mixture-of-LoRA-experts(MoLE)，阻止knonledge confusion
最近的VLMs的通过利用LLMs的泛化性，来提高多模态的理解和泛化能力。基于CLIP的image encoder，一定程度上和language modality有对齐，当前的VLMs通常利用大量的image-text pairs来来接vision encoder和LLM，使得LLM能够接收和理解visual content。例如， Flamingo 通过gated attention，将visual features集成到LLM中，LLaVA直接将vision encoder和LLM与MLPs连接，展示出了多模态对话的能力。另外，最近的工作提高了vision encoder的representation，进一步增强了VLMs的perception，例如, Mini-Gemini采用了一个额外的vision encoder用于high-resolution refinement，\(S^2\) 通过scaling up image scale，引入了多个visual branches。然而，以上的方法通常用higher-resolution image inputs进行输入，或者用设计的额外的modules，这需要更多的computational resources。在这个工作中，作者提出了，通过adaptive supervision of a mixture of visual experts，提高了VLMs的vision modality，这里，higher-quality training data是optional
Knowledge distillation是将teacher model中的knowledge transfer到一个student model，来提高student，不需要额外的parameters。因为vision-language models变得流行，如何通过KD来增强VLMs是一个notable research direction。
- KD on vision encoder：在vision-language models中，vision encoder对于从visual signals中提取high-level features至关重要，在这之前, DINOv2采用self-distillation，从larger teacher中训练smaller variants，也有人从CLIP teacher中蒸馏model。由于single teacher的有限的ability，这个不能inspire the potential of students。相比之下，AM-RADIO开始利用multiple vision experts来蒸馏vision encoder，代替VLMs中原始的，然而，这些方法没有集成到VLMs的框架中，是独立的，作者的工作是首个通过knowledge distillation，以统一的方式，结合distinct encoders for VLMs，这对vision modality的对齐是有好处的。
- KD with multiple teachers：传统的deep learning，在多个teacher的supervision下，用triplet loss蒸馏student model。最近，OneS是首个将multiple teachers引入LLMs knowledge distillation，从多个不同的预训练的experts中gather key knowledge，直到现在，据作者所知，作者是首个用multiple vision teachers，通过知识蒸馏来增强VLMs的。
作者提出了MoVE-KD，是一个新的knowlege distillation 方法，用了多个visual encoders for visual-languag models。首先，**作者采用了encoder adapters，将多个teacher encoders的输出project到一个unified representation space。基于来自pretrained CLIP model的[CLS] attention，weights动态地分配给teacher encoders和visual tokens，然后，KD loss基于teacher weights 和token weights的加权和来计算的。为了缓解learning multiple sources of knowlege的潜在的冲突，作者引入了一个mixture-of-LoRA-experts结构，目标函数是最小化text loss和KD loss。
Encoder adapter：给定一个visual input，通过不同的visual encoder teachers进行处理，得到visual tokens，由于不同sources的visual encoders的不一致的representation space，visual tokens不能直接和student visual token进行对齐。同时，在知识蒸馏（KD）方法中，常用的线性插值技术难以将来自不同编码器的差异显著的token映射到一个统一且对学生模型友好的token空间中。因此，为了匹配dimensions和对齐token spaces，作者为每个teacher encoder引入了encoder adapters。每个adapter，是一个对respective teacher的输出定制的two-layer MLP，是独立地utilized and optimized using the knowledge distillation loss.
Mixture-of-LoRA-expers：在token对齐之后，student encoder，从预训练的CLIP visual encoder中进行初始化，被微调，来学习teacher tokens。然而，作者发现，直接微调student encoder有一个挑战：直接在target dataset上微调，导致过拟合和灾难性遗忘的问题，会影响模型的accuracy和泛化性。另外，很难用一个shared weights来实现unified representation，保持所有的优点，同时解决teacher tokens之间的conflicts。为了解决这个问题，作者引入了mixture-of-LoRA-experts，这个架构包含两个components：mixture-of-experts(MoE)和LoRA expert，首先，按照典型的MoE的设计，基于inputs来选择激活特定的experts，通常，对于每个layers的FFN in the student encoder和input features，MoED 输出是这样的：

\[F^*(x) = F(x) + E_i(x) \\ \text{with } \quad i = \argmax(\text{Softmax}(f(x)))\]

router \(f\) 是一个linear layer，学习每个experts的weights，\(E_i\) 是选择的i-th expert，\(F(x)\)是原始FFN的输出，这个方法通过激活相关的experts，对于KD in multi-visual tasks特别有效，增强了model对于多种domain的visual knowledge的适应能力。

然而，之前的MoE将复制FFN，来作为独立的experts，导致参数量的极大增加，因此，作者的方法用了parameter-efficient LoRA作为expert，通过two low-rank matrices，代替一个large maramter matrix，LoRA显著地减小了训练的参数，同时保持了模型性能。另外，LoRA展示出了更好的泛化性和迁移能力，在encoder用有限数据微调的情况下，很有用。

MoLE加速了distillation 过程，使得model能够更好地capture每个teacher的长处，避免了knowledge之间的冲突，只有很少的parameter overhead.
- Attention-guided KD regularization：将来自多个teachers的knowledge蒸馏到一个model中的关键是guidde the student in which features should be focused on因为不同的visual encoders对于one image有不同的理解，一些representations是没有用的，或者冗余的，关注这些representations会弱化real important和unique features的学习，因此，一个合适的方式是找到一个合适的constraint来正则化distillation。
这样的constraint，student应该被distillation loss所guided，这在fine-grained token level和coarse-grained teacher level上，对teacher tokens里的有价值和冗余的regions进行判别。因此，一个理想的distillation loss on visual tokens是这样的：

\[\mathcal{L}_{kd} = \sum_{i=1}^{m} W_i^{(\text{tea})} \sum_{j=1}^{n} \left( W_j^{(\text{tok})} + \frac{1}{n} \right) \text{MSE}(V_{i,j}^{(t)}, V_j^{(s)}),\]

\(m\) 表示teacher encoders的数量，\(n\) 是visual tokens的sequence length的长度，\(V^{t}\) 和 \(V^{s}\) 分别表示teacher和student的visual tokens，\(W^{tok}\) 和 \(W^{tea}\) 表示token-level和teacher-level的weight vectors。

给定上述的motivation，一个好的visual encoder应该有一个强的perception和focusing ability for key information in the image，本文中，相反用常用的learnable tokens来capture weights，作者采用了一个更高效和泛化的方式，通过采用 \([CLS]\) token in CLIP。\([CLS]\) token和其它的visual tokens之间的cross-attention 解释了image中的key regions，展示出了less interest in repeated and 不重要的information。这个focusing的characteristic，和有影响的regions一样，对于学生来说是重要的。因此，作者用一个CLIP的\([CLS]\) attention提供的weights，来设计KD regularization
- Token weight：作者希望student encoder能够focus on key visual tokens，就像pretrained CLIP，因此，作者计算了 \([CLS]\) attention between \([CLS] token V^{cls}\) and other visual tokens \(V^{res}\) of CLIP，用normalization作为每个token的weight。
\(W^{tok}\) 表示token weight，\(W^Q\) 和 \(W^V\) 是这一层CLIP输出的queries和keys的transformation matrices。另外，\(d\) 是一个factor，用于stabilizing the values。
- Teacher weight：对于coarse teacher-level regularization，采用 \(CLS\) token \(V^{cls}\) 和i-th teacher's tokens \(V^t_i\) 之间的cross-attention的mean value的softmax作为它的weight。这些teachers的weights，表示teachers对于这些images的responses，展示出了它们的contributions to the recognition，formulation如下：
\(W^{tea}\) 是teacher weight，m 是teach encoder的数量，student是通过pretrained CLIP encoder来进行的初始化，来防止knowledge的遗忘，将CLIP作为一个teacher，设一个相对较高的fixed weight
- overall loss：总的loss包含两个parts，称之为 \(\mathcal{L}_{text}\) 和 \(\mathcal{L}_{kd}\), 表示传统的log-likelihood loss in VLMs，总的loss如下：
\(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{text}} + \lambda_{\text{kd}} \cdot \mathcal{L}_{\text{kd}},\)

MoVE-KD pipelien \(Fig.1^{[1]}\) 用encoder adapters将teacher encoder's outputs进行project，基于CLIP的 \([CLS]\) token对teacher weight和token weight进行分配，为了缓解knowledge conflicts的问题，在student encoder中引入了MoLE结构。

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Viusal Encoders[1]

Time

Key Words

总结

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Viusal Encoders^[1]