STAR

发表于 2024-10-24 更新于 2024-11-07 分类于 Papers 阅读次数：本文字数： 2.2k 阅读时长 ≈ 8 分钟

End-to-End Spatio-Temporal Action Localisation with Video Transformers^[1]

作者是来自google的Alexey Gritsenko, Xuehan Xiong等人，论文引用[1]:Gritsenko, Alexey A. et al. “End-to-End Spatio-Temporal Action Localisation with Video Transformers.” 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023): 18373-18383.

Time

2023.Apr

Key Words

without resorting to external proposals or memory banks
directly predicts tubelets even do not have full tubelet annotations

总结

最好的时空行为检测使用额外的person proposals和复杂的external memory banks。作者提出了一个端到端的、纯transformer的模型，能够直接输入一个视频，输出tubelets(一系列的bboxes和action classes at each frame). 这个灵活的模型能够用稀疏的bbox supervision on individual frames or full tubelet annotations。在这两种情况下，预测连贯的tubelet作为输出。另外，这个模型不需要额外的对proposals的前处理或者NMS这样的后处理。

STAD/STAL任务通常有两种处理方式：
- 在给定相邻帧作为时空上下文的时候，在单一的keyframe预测bbox和actions
- 预测一系列的bboxes和actions(tubes) for each actor at each frame in the video. 最强的模型，是基于keyframe来做的，用一个两阶段的pipeline inspired by FastRCNN：首先用一个单独的检测器来得到proposals，proposals中的features然后根据actions of interest进行汇聚和分类。这些模型需要很多的memory banks，来包含其它帧的上下文信息。proposal-free算法，不需要额外的person detectors，可以用于keyframe-level和tubelet-level，它们的性能落后于proposal-based的方法。这里，作者展示了端到端的可训练的模型超过了两阶段的approach。
STAR包含一个pure-transformer，是基于DETR检测的模型，这个模型是端到端，不需要前处理或者后处理，相比于之前的大多数的工作。这个模型的初始阶段是一个vision encoder，紧接着是一个decoder，将可学习的latent queries(这个代表每个actor in the video in video)处理成output tubelets(tubelet是一系列的bboxes和action classes at each time step of input video clip)。这个模型是多种功能的，能够用fully-labeled tube annotations，或者 sparse keyframe annotations进行训练。后者，这个网络仍然能够预测tubelets，学习将actor的检测联系起来。这个behaviour是由factorised queries，解码器的结构和tubelet matching in loss which all contain temporal inductive biases促成的。两阶段的先得到detections计算proposals，然后进行行为的分类，这种方法在AVA这种数据集上表现很好。之前的一些proposal-free的方法基于目标检测的架构例如SSD, CenterNet, YOLO等，它们被proposal-based 方法所超越；另外，这些方法包含了一个separate network，一个是学习视频表征，一个是学习proposals for keyframe，不能预测tubelets。这个工作最像TubeR，作者的这个模型是基于DETR，是一个纯的transformer，不同于TubeR，展示了这个模型也能够预测tubelets。
Encoder: 这个模型包含一个vision encoder，紧接着是一个decoder将学习到的query tokens处理成tubelets。吸收temporal inductive biases into decoder，来提高精度和tubelet prediction。这个模型是受DETR架构的启发，也是用set-based loss和匈牙利匹配来训练。backbone输入一个视频，产生输入视频的feature representation。用了ViViT factorised encoder，这个spatio-temporal dimensions依赖于patch size when tokenising the input。为了保持spatio-temporal information，去掉了原始transformer中的spatial-和temporal-aggregation steps，如果temporal patch size大于1，沿着时间轴对最后一个feature map进行双线性上采样，保持原始的时序resolution。
Tubelet Decoder：decoder处理视觉特征，\(x \in \mathbb{R}^{T\times h\times w\times c}\) along with learned queries。\(q \in \mathbb{R}^{T\times S\times d}\) to outputs tubelets, \(y=(b,a)\)是一系列的bboxes。\(b \in \mathbb{R}^{T\times S\times 4}\)和对应的actions， \(a \in \mathbb{R}^{T\times S\times C}\)。这里 \(S\)表示每帧的最大的数量的bboxes。\(C\) 表示输出类别的数量。将learned queries用transformer decoder解码成输出的detections。总的来说，这个decoder包含 \(L\) layers，每个对queries执行自注意力操作，在queries和encdoer outputs之间进行交叉注意力操作。这里作者修改了queries，自注意力和交叉注意力 for spatio-temporal localization scenario，加入了额外的temporal inductive biases，来提高精度。
Queries：Qeries, q, in DETR，用encoderd visual features解码成 bboxes predictions，类似于其它检测架构中的anchors。定义queries的最直接的方式是随机初始化\(q \in \mathbb{R}^{T\times S\times d}\)。然而，作者找到了一个更有效的方式来factorise queries into separate learned spatial (\(q \in \mathbb{R}^{S\times d}\)) and temporal(\(q \in \mathbb{R}^{T\times d}\))。为了得到最后的tubelet queries，简单地重复spatial queries across all frames, 然后将它们加到它们对应的temporal embedding at each location中。如图所示，更准确地说，\(\mathbf{q}_{ij} = \mathbf{q}_i^t+\mathbf{q}_j^s\)，这里\(i\)和 \(j\)分别表示temporal和spatial indices。这个factorised query representation表示相同的spatial embedding在所有帧中都在使用。直觉上来说，这个encourage \(i^{th}\) spatial query embedding, \(q_{i}^s\)来bind to the same location across different frames of the video。因为物体在不同的帧之间会有位移，对于将bboxes连接到一起可能有帮助。
Decoder layer：原始transformer中的decoder layer包含自注意力 on the queries, \(q\)，紧接着是 queries and the outputs of the encoder, x之间的交叉注意力，然后是一个MLP layer。在这个模型中，将自注意力和交叉注意力在时间和空间上进行分解，引入了一个temporal locality inductive bias，增加了模型的效率。具体地说，应用MHSA，首先计算queries, keys和values。
localisation and classification heads：得到了网络最后的预测，\(y=(b,a)\)，通过应用一个小的feed-forward 网络来输出to decoder，这个bboxes序列，通过3层的MLP获得，一个单层的liear projection用来得到class logits。当每一帧预测固定数量的 \(S\) bboxes的时候，\(S\) 是比frame中的GT的数量的最大值都大，也引入了额外的class label，代表背景
模型直接预测bboxes和action classes at each frame of the input video。在很多数据集上例如AVA，是稀疏标注，只标注了关键帧。为了利用可用的annotations，计算training loss only at the annotated frames of the video。set-based 检测模型例如DETR能够以任意的order进行predictions，这就是为什么在计算training loss之前先将predictions和ground进行match。这里有两种matching的方式：
- 独立地执行bipartite matching at each frame to align model's predictions to the ground truth before computing the loss.
- 另一个是执行tubelet matching，所有的有着相同spatial index的queries \(q^s\) 必须和相同的ground truth across all frames of the input video。直觉上来看，当有full tubelet annotations的时候，tubelet matching提供了更强的监督信息，注意到不管执行哪种matching，这个损失的计算和整个的模型架构是一样的。
这个思路是基于DETR的，不需要额外的proposals或者NMS来做后处理，用DETR的思路for action localization已经被TubeR和WOO采用了，然而，不同之处在于：WOO不检测tubelets，只有中间的keyframe的actions。**将queries在空间和时间上进行分解，来提供inductive biases。另外，预测action classes separately for each time step in the tubelet, 意味着每个queries binds to an actor in the video. 相反， TubR，对比之下，将queries进行参数化，以至于它们能够和separate actions进行关联(features在tubelet上进行average pool，然后linearly classified into a single action class)。这个choice也意味着TubeR需要一个额外的action switch head来预测tubelets什么时候开始和结束，作者这里就不需要了，因为tubelet中不同的时间step有不同的action classes。另外，考虑两种不同的matching in the loss computation unlike TubeR，tubelet matching 设计用来预测更多的时序连续的tubelets。相比于TubeR，实验性地表明decoder的设计允许这个模型能够精确地预测tubelets，即使有weak, keyframe supervision。最后TubeR要求额外的复杂度(short-term context module和额外的memory bank)，

Overview \(Fig.1^{[1]}\) Our model processes a fixed-length video clip, and for each frame, outputs tubelets (i.e. linked bounding boxes with associated action class probabilities). It consists of a transformer-based vision encoder which outputs a video representation, x RT h w d. The video representation, along with learned queries, q (which are factorised into spatial qs and temporal components qt) are decoded into tubelets by a decoder of L layers followed by shallow box and class prediction heads

\(Fig.2^{[1]}\) Our decoder layer consists of factorised self-attention(SA)(left) and cross-attention(CA)(right) operations designed to provide a spatio-temporal inductive bias and reduce computation. Both operations restrict attention to the same spatial and temporal slices as the query token, as illustrated by the receptive field(blue) for a given query token(magenta). Factorised SA consists of two operations, whilst in Factorised CA, there is one operation.

End-to-End Spatio-Temporal Action Localisation with Video Transformers[1]

Time

Key Words

总结

End-to-End Spatio-Temporal Action Localisation with Video Transformers^[1]