Diff_Transformer
Differential Transformer[1]
作者是来自MSRA和Tsinghua的Tianzhu Ye等人。论文引用[1]:Ye, Tianzhu et al. “Differential Transformer.” ArXiv abs/2410.05258 (2024): n. pag.
Time
- 2025.Apr
Key Words
- 一句话来说:用两个softmax attention functions之间的差,作为attention socres,来消除attention noise。
总结
- Transformer 倾向于将attnetion过多地分配给不相关的context,在这个工作中,作者介绍了Diff Transformer,放大了relevant context的attention,同时抵消了noise,特别地,differential attention机制通过计算两个独立的 softmax 注意力图之间的差值来得到注意力分数。subtraction 操作cancel 了noise,提升了sparse attention patterns的出现。实验结果表明:Diff Transformer在多个scaling up model size和training token的多种设置下,超过了Transformer。另外更有趣的是,它在实际应用中,提供了notable advantages,例如long-context modeling,key information retrieval和幻觉缓解,in-context learning,activation outliers的reduction。通过减少不相关context的distract, Diff Transformer在question answering和text summarization上缓解了幻觉。对于in-context learning,Diff Transformer不仅能增强精度,也对于order permutation更加robust,order permutation被认为是chronic robustness issue。结果表明Diff Transformer是一个高效和有前途的架构。