Young's Blog

Multi-Head Mixture-of-Experts

发表于 2025-05-12 更新于 2025-05-16 分类于 Papers 本文字数： 1.8k 阅读时长 ≈ 7 分钟

Multi-Head Mixture-of-Experts^[1]

作者是来自MSRA的Xun Wu等人，论文引用[1]:Wu, Xun et al. “Multi-Head Mixture-of-Experts.” ArXiv abs/2404.15045 (2024): n. pag.

Time

-2024.Apr

Key Words

low expert activation
multi-head
一句话总结：类似多头注意力的操作，将输入分成多给sub-tokens，每个sub-tokens给到experts，最后将所有的输出在进行merge，还原为初始的形状，每个sub-tokens包含了不同feature space的语义信息

总结：

**稀疏MoE在不增加计算成本的情况下，扩展了model的capacity，然而，它展示出了low expert activation的问题，仅有一小部分experts被激活，用于优化，导致suboptimal的性能，限制了在复杂任务中学习大量experts的有效性。在本文中，作者提出了Multi-MoE，MHMoE将每个输入的token或分成多个sub-tokens，然后这些sub-tokens被分配给多个并行的experts进行处理，无缝合成为原来的token form。以上的操作使得MH-MoE显著地提高了expert的activation，同时在不同的experts汇总，集体attend to 多个representation spaces，来加深context understanding，另外，值得注意地是: MH-MoE直接可以执行，和其它的SMoE框架解耦，使得很容易地和这些框架集成。

阅读全文 »

Self-Guided Masked Autoencoder

发表于 2025-05-12 分类于 Papers 本文字数： 2.5k 阅读时长 ≈ 9 分钟

Self-Guided Masked Autoencoder^[1]

作者是来自Google和首尔国立大学的Jeongwoo Shin等人，论文引用[1]:Shin, Jeongwoo et al. “Self-Guided Masked Autoencoder.” Neural Information Processing Systems (2024).

Time

Key Words

Masked Autoencoder

总结

MAE是用于表征学习的一种自监督的方式，广泛地应用于CV中的下游任务。尽管它很成功，但是，但还是没有完全揭示它是如何学习的。在本文中，作者做了深入的分析，发现：MAE从pretraining早期阶段，学习patern-based patch-level clustering。基于这个理解：作者提出了self-guided masked autoencoder，通过利用patch clustering中的progress，内在地产生informed mask，代替原始的MAE的随机的masking，作者的方法不需要依赖任何外部的models或者supplementary information，显著地提高了它的learning progress，完好地保持了MAE自监督的本质的优势。

阅读全文 »

LLMDet

发表于 2025-05-07 更新于 2025-06-22 分类于 Papers 本文字数： 2.7k 阅读时长 ≈ 10 分钟

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models[1]

作者是来自中山大学、阿里等机构的Shenghao Fu等人，论文引用[1]:Fu, Shenghao et al. “LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models.” ArXiv abs/2501.18954 (2025): n. pag.

Time

2025.Jan

Key Words

image-level and region-level captions
一句话来说：之前的开集检测器，用的都是short captions for each object，都是一些coarse descriptions，这个工作构建了一个更大的、更详细的caption 数据集，然后，利用LLM，能够作为image-level和region-level的caption，得到一个很好的开集检测器

总结

最近开集检测器用大量的region-level 标注的数据实现了很好的性能。在这个工作中，作者展示了，通过为每个image生成image-level的detailed captions，将开集的detector和一个LLM一起训练，能够实现性能的提升。为了实现这个目标，作者搜集了一个数据集，GroundingCap-1M，每个image都有一个关联的grounding labels和image-level的detailed caption，有了这个dataset，作者用一个标准的grounding loss和一个caption generation loss，来微调这个开集检测器，作者利用LLM来产生region-level的short captions for each region of interest 和image-level的long captions for whole image, 在LLM的监督下，得到了一个detector LLMDet，超过了baseline。

阅读全文 »

Mr.DETR

发表于 2025-05-07 更新于 2025-05-08 分类于 Papers 本文字数： 3k 阅读时长 ≈ 11 分钟

Mr.DETR: Instructive Multi-Route Training for Detection Transformers^[1]>

作者是来自Visual AI Lab、HKU和Meituan的Chang-Bin Zhang等人。论文引用[1]:Zhang, Chang-Bin et al. “Mr. DETR: Instructive Multi-Route Training for Detection Transformers.” ArXiv abs/2412.10028 (2024): n. pag.

Time

2025.Apr

Key Words

one-to-one, one-to-many assignments
Multi-route training
一句话总结：为了加速DETR-like model的收敛，一些方法采用了auxiliary training，作者这里提出了multi-training route的方法，用3个route，route-1用一个独立的FFN for o2m, route-2是primary route for o2o, route-3 为了提高不同route的queries的兼容性，采用了learnable queries作为instruction，然后进行instruction self-attention，其它的没啥。

总结

现有的增强detection transformer的方式是同故宫引入auxiliary one-to-many assignment。在这个工作中，作者将model视为一个multi-task framework，同时进行one-to-one和one-to-many predictions。作者在这两个训练目标中，研究了Transformer decoder中的每个component的作用，包括self-attention, cross-attention和FFN。作者的结果展示，decoder中的任何独立的component能够同时有效地学习targets，即使当一些component是共享的。这个发现促使作者提出了一个multi-route training paradigm, 一个primary route用于one-to-one prediction，两个辅助的training routes用于one-to-many prediction，作者通过一个新的instructive self-attention, 能够动态地和灵活地指导object queries 用于one-to-many prediction,增强training机制。这个辅助的routes在推理的时候是去掉的，确保对model架构和inference cost造成影响。

阅读全文 »

Massive Values

发表于 2025-05-07 分类于 Papers 本文字数： 43 阅读时长 ≈ 1 分钟

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding^[1]

作者是来自Rutgers等学校的Mingyu Jin等人。论文引用[1]:

Time

2025.May

### Key Words

总结

Diff_Transformer

发表于 2025-04-30 更新于 2025-05-03 分类于 Papers 本文字数： 1.3k 阅读时长 ≈ 5 分钟

Differential Transformer^[1]

作者是来自MSRA和Tsinghua的Tianzhu Ye等人。论文引用[1]:Ye, Tianzhu et al. “Differential Transformer.” ArXiv abs/2410.05258 (2024): n. pag.

Time

2025.Apr

Key Words

一句话来说：用两个softmax attention functions之间的差，作为attention socres，来消除attention noise。

总结

Transformer 倾向于将attnetion过多地分配给不相关的context，在这个工作中，作者介绍了Diff Transformer，放大了relevant context的attention，同时抵消了noise，特别地，differential attention机制通过计算两个独立的 softmax 注意力图之间的差值来得到注意力分数。subtraction 操作cancel 了noise，提升了sparse attention patterns的出现。实验结果表明：Diff Transformer在多个scaling up model size和training token的多种设置下，超过了Transformer。另外更有趣的是，它在实际应用中，提供了notable advantages，例如long-context modeling，key information retrieval和幻觉缓解，in-context learning，activation outliers的reduction。通过减少不相关context的distract, Diff Transformer在question answering和text summarization上缓解了幻觉。对于in-context learning，Diff Transformer不仅能增强精度，也对于order permutation更加robust，order permutation被认为是chronic robustness issue。结果表明Diff Transformer是一个高效和有前途的架构。

阅读全文 »

SpatialVLA

发表于 2025-04-29 更新于 2025-05-03 分类于 Papers 本文字数： 1.4k 阅读时长 ≈ 5 分钟

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model^[1]

作者是来自上海AI Lab、TeleAI和ShanghaiTech的Delin Qu等人。论文引用[1]:Qu, Delin et al. “SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model.” ArXiv abs/2501.15830 (2025): n. pag.

Time

2025.Mar

Key Words

Ego3D position Encoding
Adaptive Action Grids

总结

作者认为，spatial understanding在robot manipulation中是keypoint，提出了SpatialVLA来探索有效的spatial representation。特别地，引入了Ego3D Position encoding，将3D information inject到input observations of the visual-language-action model，提出了adaptive action grids来represent spatial robot movement actions with adaptive discretized action grids，促进了学习 generalizable和transferrable spatial action knowledge for cross-robot control。SpatialVLA是第一个pretrained on top of a vision-language model with 1.1 Million real-world robot episodes，来学习一个在多个环境中generalist manipulation policy，在预训练之后，SpatialVLA可以以zero-shot的方式，来执行多个tasks.

阅读全文 »

MAP

发表于 2025-04-22 更新于 2025-04-29 分类于 Papers 本文字数： 2.3k 阅读时长 ≈ 8 分钟

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining^[1]

作者是来自清华叉院和上海AI Lab、QiZhi 研究院的Yunze Liu和Li Yi，论文引用[1]:Liu, Yunze and Li Yi. “MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining.” ArXiv abs/2410.00871 (2024): n. pag.

Time

2025.Mar

Key Words

masked Autoregressive Pretraining
一句话总结：结合了MAE of Transformer的local features 和AR of Mamba的long context modeling

总结

混合的Mamba-Transformer网络最近受到了很多的关注，这些网络利用Transformer的可扩展性和Mamba的long-context modeling和高效计算。然而，有效地预训练这样的混合网络仍然是一个open question，现有的方法，例如MAE 或者自回归 pretraining，主要聚焦于single-type network 架构，相比之下，对于Mamba和Transformer的混合结构，预训练策略必须有效，基于此，作者提出了Masked Autoregressive pretraining，以统一的范式，提高了Mamba和Transformer modules的性能。

阅读全文 »

MoE

发表于 2025-04-21 分类于 Papers 本文字数： 324 阅读时长 ≈ 1 分钟

MoE

解码器中包含多个FFNN，每一个FFNN对应一个expert，在experts之前加入要给router，被训练用来选择每个token用哪个expert。router本身也是一个FFNN，根据特定的输入选择experts，router输出概率值，利用这些概率来选择最匹配的expert。expert层返回输出，并乘以门控值(选择概率)。router和experts共同构成了MoE层。优点是参数量大，但训练和推理成本低。
LoRA：用两个低秩矩阵相乘来拟合一个高秩矩阵，这里拟合的不是模型的参数矩阵 $W_0$ 本身，而是参数矩阵的增量 $\delta{W}$，更新后的参数矩阵变为: \[W = W_0 + \delta{W} = W_0 + BA\] $B \in \mathbb{R}^{d_{out} \times r}$，$A \in \mathbb{R}^{r \times d_{in}}$, $r << min(d_{in}, d_{out})$，微调过程中只需要存储两个低秩的A和B矩阵即可，大幅减少存储空间。 A 用高斯初始化，B用0初始化。增加一个缩放系数 $\alpha/r$， $\alpha$ 为超参数:

\[h = W_0x + \delta{W}x = W_0x + \alpha/rBAx\] 训练过程中，固定 $W_0$不变， B用全零初始化可以保证在初始化阶段 $ =0 $，调整 $\alpha$ 相当于调整学习率

参考链接

https://zhuanlan.zhihu.com/p/22651790583

aed

发表于 2025-04-20 更新于 2025-05-04 分类于 Papers 本文字数： 3.2k 阅读时长 ≈ 12 分钟

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors^[1]

作者是来自University of Bucharest等机构的Nicolae-Catalin Ristea等人，论文引用[1]:Ristea, Nicolae-Cătălin et al. “Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors.” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 15984-15995.

Time

2024.Mar

Key Words

用motion weight进行加权，然后用了self-distillation，同时还使用了synthetic anomalies data，加入到training data中，来提高video anomaly的性能。

总结

作者提出了一个高效的异常时间检测model，基于应用在video frame level上的轻量的AE。提出的model的创新型有三个方面：(1)基于motion gradients，引入了一个方式来对tokens进行加权，将focus的重心从static background scene转移到foreground objects;(2) 集成了一个teacher decoder和一个student decoder，利用两个decoder的输出的差异来提高anomaly detection; (3) 生成合成的abnormal events，来增强训练videos，让masked AE model来重建original frames和对应的pixel-level anomaly maps。作者的设计是一个高效且有效的model。

阅读全文 »

Multi-Head Mixture-of-Experts[1]

Time

Key Words

总结：

Self-Guided Masked Autoencoder[1]

Time

Key Words

总结

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models[1]

Time

Key Words

总结

Mr.DETR: Instructive Multi-Route Training for Detection Transformers[1]>

Time

Key Words

总结

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding[1]

Time

### Key Words

总结

Differential Transformer[1]

Time

Key Words

总结

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model[1]

Time

Key Words

总结

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining[1]

Time

Key Words

总结

MoE

参考链接

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors[1]

Time

Key Words

总结

Multi-Head Mixture-of-Experts^[1]

Self-Guided Masked Autoencoder^[1]

Mr.DETR: Instructive Multi-Route Training for Detection Transformers^[1]>

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding^[1]

Differential Transformer^[1]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model^[1]

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining^[1]

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors^[1]