MTA

发表于 2025-04-09 分类于 Papers 阅读次数：本文字数： 486 阅读时长 ≈ 2 分钟

Multi-Token Attention^[1]

作者是来自FAIR的Olga Golovneva等人，论文引用[1]:Golovneva, Olga et al. “Multi-Token Attention.” (2025).

Time

2025.Apr

Key Words

single token similarity bottleneck

总结

Soft attention是一个重要的机制，使得LLMs能够在给定的context中locate相关的parts。然而，individual attention weights是由single query和key token vector的相似度决定的，这个single token attention造成了区分a relevant part from the rest of the context的信息的瓶颈。为了解决这个问题，作者提出了一个新的attention方法，Multi-Token Attention，使得LLms能够同时在多个query和key vectors上condition their attention weights。这是通过在queries、keys和heads上应用卷积操作实现的,使得相邻的queries和keys能够印象彼此的attetnion weights for more precise attention。因此，作者的方法能够用更丰富的、精细的信息来locate relevant context，超过了single vector capacity。

标准的多头注意力，通过比较当前的query vector和key vectors的相似度来实现。和query相似的keys得到更高的attention weights，因此相应的value vectors会dominate output vectors。作者argue：依赖于single token vector的similarity会带来fundamental limitation to the attention mechanism。在很多情况下，context的relevant part不能被single token所identify。作者提出MTA的目标是用similarities of multiple vector pairs to determine where attention must focus.作者在queries、keys和heads上设计了一些卷积操作，使得attention weights 能够condition on neighboring keys、previous queries和other heads。
Pre-softmax convolution：在attention logits上做了一个conv，结合来自multiple query和key tokens的information。
Post-softmax convolution：\[A = \text{Mask}_{0}\left( \text{Conv2d}_{\theta}\left( \text{Softmax}\left( \text{Mask}_{-\infty}(\hat{A}) \right) \right) \right)\]
Head mixing convolution：对不同的组的heads进行处理：

\(Fig.1^[1]\) Multi-Token Attention on the right，相比于标准的attention，对每个head用了一个key-query convolution，在softmax normalization之后，head convolution across groups of heads

\(Fig.2^[1]\) MTA用了key-query和head convolution over attention values

Multi-Token Attention[1]

Time

Key Words

总结

Multi-Token Attention^[1]