MTA

Multi-Token Attention[1]

作者是来自FAIR的Olga Golovneva等人,论文引用[1]:Golovneva, Olga et al. “Multi-Token Attention.” (2025).

Time

  • 2025.Apr

Key Words

  • single token similarity bottleneck

总结

  1. Soft attention是一个重要的机制,使得LLMs能够在给定的context中locate相关的parts。然而,individual attention weights是由single query和key token vector的相似度决定的,这个single token attention造成了区分a relevant part from the rest of the context的信息的瓶颈。为了解决这个问题,作者提出了一个新的attention方法,Multi-Token Attention,使得LLms能够同时在多个query和key vectors上condition their attention weights。这是通过在queries、keys和heads上应用卷积操作实现的,使得相邻的queries和keys能够印象彼此的attetnion weights for more precise attention。因此,作者的方法能够用更丰富的、精细的信息来locate relevant context,超过了single vector capacity。
  1. 标准的多头注意力,通过比较当前的query vector和key vectors的相似度来实现。和query相似的keys得到更高的attention weights,因此相应的value vectors会dominate output vectors。作者argue:依赖于single token vector的similarity会带来fundamental limitation to the attention mechanism。在很多情况下,context的relevant part不能被single token所identify。作者提出MTA的目标是用similarities of multiple vector pairs to determine where attention must focus.作者在queries、keys和heads上设计了一些卷积操作,使得attention weights 能够condition on neighboring keys、previous queries和other heads

  2. Pre-softmax convolution:在attention logits上做了一个conv,结合来自multiple query和key tokens的information。

  3. Post-softmax convolution\[A = \text{Mask}_{0}\left( \text{Conv2d}_{\theta}\left( \text{Softmax}\left( \text{Mask}_{-\infty}(\hat{A}) \right) \right) \right)\]

  4. Head mixing convolution:对不同的组的heads进行处理:

Multi-Token Attention \(Fig.1^[1]\) Multi-Token Attention on the right,相比于标准的attention,对每个head用了一个key-query convolution,在softmax normalization之后,head convolution across groups of heads

convolution \(Fig.2^[1]\) MTA用了key-query和head convolution over attention values