Massive Values
Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding[1]
作者是来自Rutgers等学校的Mingyu Jin等人。论文引用[1]:
Time
- 2025.May
作者是来自Rutgers等学校的Mingyu Jin等人。论文引用[1]:
作者是来自MSRA和Tsinghua的Tianzhu Ye等人。论文引用[1]:Ye, Tianzhu et al. “Differential Transformer.” ArXiv abs/2410.05258 (2024): n. pag.
作者是来自上海AI Lab、TeleAI和ShanghaiTech的Delin Qu等人。论文引用[1]:Qu, Delin et al. “SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model.” ArXiv abs/2501.15830 (2025): n. pag.
作者是来自清华叉院和上海AI Lab、QiZhi 研究院的Yunze Liu和Li Yi,论文引用[1]:Liu, Yunze and Li Yi. “MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining.” ArXiv abs/2410.00871 (2024): n. pag.
解码器中包含多个FFNN,每一个FFNN对应一个expert,在experts之前加入要给router,被训练用来选择每个token用哪个expert。router本身也是一个FFNN,根据特定的输入选择experts,router输出概率值,利用这些概率来选择最匹配的expert。expert层返回输出,并乘以门控值(选择概率)。router和experts共同构成了MoE层。优点是参数量大,但训练和推理成本低。
LoRA:用两个低秩矩阵相乘来拟合一个高秩矩阵,这里拟合的不是模型的参数矩阵 \(W_0\) 本身,而是参数矩阵的增量 \(\delta{W}\),更新后的参数矩阵变为: \[W = W_0 + \delta{W} = W_0 + BA\] \(B \in \mathbb{R}^{d_{out} \times r}\),\(A \in \mathbb{R}^{r \times d_{in}}\), \(r << min(d_{in}, d_{out})\), 微调过程中只需要存储两个低秩的A和B矩阵即可,大幅减少存储空间。 A 用高斯初始化,B用0初始化。增加一个缩放系数 \(\alpha/r\), \(\alpha\) 为超参数:
\[h = W_0x + \delta{W}x = W_0x + \alpha/rBAx\] 训练过程中,固定 \(W_0\)不变, B用全零初始化可以保证在初始化阶段 $ =0 $,调整 \(\alpha\) 相当于调整学习率
https://zhuanlan.zhihu.com/p/22651790583
作者是来自University of Bucharest等机构的Nicolae-Catalin Ristea等人,论文引用[1]:Ristea, Nicolae-Cătălin et al. “Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors.” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 15984-15995.
作者是来自牛津和微软的Yuanze Lin等人,论文引用[1]:Lin, Yuanze et al. “Olympus: A Universal Task Router for Computer Vision Tasks.” ArXiv abs/2412.09612 (2024): n. pag.
作者是来自TUM和FAIR的Tim Meinhardt等人,论文引用[1]:
作者是来自北大等机构的Jiajun Cao等人,论文引用[1]: Cao, Jiajun et al. “MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders.” ArXiv abs/2501.01709 (2025): n. pag.
作者是来自FAIR的Kaiming He等人。论文引用[1]:He, Kaiming et al. “Momentum Contrast for Unsupervised Visual Representation Learning.” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019): 9726-9735.