BMViT

Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization[1]

作者是来自英国玛丽女王大学、三星剑桥AI center等机构的Ioanna Ntinou, Enrique Sanchez, Georgios Tzimiropoulos。论文引用[1]:Ntinou, Ioanna, Enrique Sanchez, and Georgios Tzimiropoulos. "Multiscale vision transformers meet bipartite matching for efficient single-stage action localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Time

  • 2024.May

Key Words

  • bipartite matching loss
  • Video Transformer with bipartite matching loss without learnable queries and decoder

总结

  1. Action localization是一个挑战行的问题,结合了检测和识别,通常是分开处理的,SOTA方法都是依赖off-the-shelf bboxes detection,然后用transformer model来聚焦于classification task。 这样的两阶段的方法不利于实时的部署。另外,单节段的方法通过共享大部分的负载来实现这两个任务,牺牲性能换取速度。类似DETR的架构训练起来有挑战。本文观察到:一个直接的bipartite matching loss可以用在ViT的output tokens上,导致一个 backbone + MLP 架构能够需要要额外的encoder-decoder head和learnable queries来同时处理这两个任务。用单一的MViTv2-S架构 with bipartite matching 来执行两个tasks,超过了MViTv2-S trained with RoI align on pre-computed bboxes。用设计的token pooling和提出的训练的pipeline,Bipartite-Matching Vision Transformer, BMViT。实现了很好的结果。
  1. STAD任务和目标检测类似,但是特殊之处在于,检测的目标通常是人,有很多个不同的类别有时同时发生。提出了额外的挑战:actions需要时序reasoning;如果人物没有执行目标的行为,person detector可能导致假阳性。目前的SOTA方法有很高的mAP,是将detection task给到了预训练的FasterRCNN;然后只聚焦于网络模型和大的训练数据。本文提出的模型能够同时执行两个任务。因为和目标检测类似,最近的针对单节段的模型都是建立在strong backbone上,能够为DETR提供强大的时序特征。基于DETR的方法展示了用于end-to-end的action localization的高效性。它们的设计用了video backbone和一个encoder-decoder的transformer。在网络设计上,是否还有提升? 从最近的 Open-World Object Detection using ViT(OWL-ViT)得到了灵感。贡献如下:

    • bipartite matching loss between saptio-temporal output embeddings of a single transformer backbone and the GT instances in a video clip.

    在这个设定中,video embeddings是独立的tokens,可以类似于DETR那样matched to predictions。也就意味着:learnable tokens是不需要的;transformer decoder也是不需要的;video backbone和encoder可以merge into一个strong video transformer,这样一个简单的approach with a careful token selection使得能够训练MViTv2-S with a simple MLP head,来直接预测bounding boxes和action classes。不需要额外的elements或者数据,单节段的MViTv2-S超过了两阶段的MViTv2-S trained using RoI align and pre-computed bounding boxes

  2. Two-stage:

    • 目前大多数现有的action detection的工作都是依赖于额外的person detector for actor localization。典型的就是FasterRCNN-R101-FPN,在COCO上训练,然后在AVA进行微调,用在STAD任务中,通过引入off-the-shelf detection,STAD的任务简化了成了一个action classification的问题。之前的方法SlowFast、MViT、VideoMAE、Hiera,这里的RoI features都是直接用于action classification。然而,这样的features将信息限制在bounding box之类。忽略的contextual information。为了解决这个问题,AIA和ACARN用了一个额外的module来捕捉actor和context/other actor之间的interactin。另外,为了建模时序信息,MeMViT基于MViT引入了mermory mechanism。实现了很高的精度。这些方法对于部署不高效。

    • Single-stage:受之前的工作的启发,一些工作开始尝试同时处理detection和classification in a unified framework。一些工作借鉴了目标检测的思路,然后引入action detection。其它的通过joint actor prposal和action classification networks来简化训练,或者将其吸引到TAL任务上;SE-STAD是基于FasterRCNN,将其引入action classification。Video action transformer是一个transformer风格的action detector,用来汇聚spatial-temporal context around target actors。最近的工作利用DETR,用learnable queries来建模action和bboxes。TubeR提出了一个基于DETR的架构,一些列的queries,coined Tubelet Queries,同时编码temporal dynamics of a specific actor's bounding box as well as 对应的actions。TubeR用单一的DETR head来建模Tubelet Queries。类似地,STMixer,提出了一个自适应地,从multi-scale spatio-temporal feature space来采样discriminative features,然后用自适应的策略under the guidance of queries来对它们进行编码。EVAD用了两个video action detection designs:token dropout focusing on keyframe-centric spatiotemporal preservation; Scene context refinement using RoI align operation and a decoder;和之前的工作相比,用一个decoder或者一个heavy module,引入了interaction features of context or other actors。

  3. Action localization的目标是检测和分类video clip中间帧的一些actions。因为video clip中,不是每个person都在执行action of interest。将一个person和actor区分开,\(t\)时刻的actor 定义为一个bbox:$\(b = [x_b, y_c, h, w]\)

  4. DETR的在训练的过程中,用Hungarian 算法来将 \(N < L\) 的(L为图像中的objects) 映射到最近的predictions,使得assignments能够最小化bbox和class cost; \(L - N\)剩下的predictions分配给empty class。通过反向传播bbox和class error for those outputs matched to a GT object and only the class error for assigned to the empty class。DETR有很好的特性 for action localization。因为learnable object queries能够convey object localization和action classification

  5. 作者follow OWL-ViT的工作,OWL-ViT引入了CLIP的encoder,最后的pooling layer去掉了。每个output embeddings 被forward to 一个小的head,包含了一个box MLP和class linear projection。因此,\(\hat{L} = h \times w\) output tokens from vision encoder,作为独立的output pairs \(<\hat{b}, p(\hat{a})>\)。这些pairs用DETR的Bipartite matching loss和GT进行match。为了改进OWL-ViT,注意到MViT是一个natural pool of spatio-temporal output embeddings that can be one-to-one matched to triplets。一个简单的用了bipartite matching进行训练的MViTv2-S架构比相同给的MViTv2-S精度要更高。称这个方法为Bipartite-Matching Vision Transformer,BMViT。给定简单的MLPs的heads。不需要给backbone增加额外的复杂度。

  6. 对于一个 \(16 \times 256 \times 256\) 的视频输入,输出为 \(8 \times 8 \times 8\) 的tokens,超过了基于DETR的架构,这个输出的tokens不回影响网络的复杂度,因为被3个MLPs独立地处理。Following OWL-ViT,增加了一个bias to predicted bounding boxes,使得每个默认是在image patch的中心,与2D grid的一致。"there is no strict correspondence between image patches and tokens representations"; 然而,bias box prediction能够加速训练和提高最后的性能。在3D 时空设定中,沿着时间轴增加了相同的 2D bias to all tokens on the same 2D grid。

  7. 因为每个head会独立地处理output tokens来产生 \(\hat{b}, p(\alpha), p(\hat{a})\)。可以选择哪些是能够更好地用于检测和识别的子任务。唯一的技术限制是每个head的输出需要和其它head的输出一对一的对应,来形成triplets (will be matched to the ground-truth instances)。这是一个很重要的考虑,因为actor detection和action classification在定义上是opposed。当检测actors的时候,只需要中间帧的信息,action recognition的任务从使用时序支持上收益。因为我们没有局限于使用相同的输出tokens for each task,考虑 \(w \times h\) tokens和 \(t = \lfloor \frac{T}{2}\),来产生 \(\hat{L} = w \times h\) bounding boxes \(\hat{b}\)和 actor/no actor的概率 \(p(\alpha)\), 然后应用temporal pooling来产生一个 equivalent set of \(\hat{L}\) tokens,能够用来计算action 的概率 \(p(\hat{a})\)

  8. 首先,为了避免当使用MViTv2-S的时候的poor detection,去掉最后的pooling layer来增加output resolution,来产生output tokens,然后用central tokens用于detection task。应用temporal pooling来形成equivalent spatio-temporal tokens用于分类任务。也要研究了alternative strategy,考虑bounding boxes和actor prediction, 对应于 \(t= \lfloor \frac{T}{2}\)\(t = \lceil \frac{T}{2}\)。将其结果进行 concatenate得到512个tokens,来进行actor detection task。在past和future tokens上应用 temporal pooling w.r.t 中间帧。就是说,将两者进行concatenate形成最后的512 tokens。注意到actor和action tokens需要一对一的对应。是产生 \(\hat{L}\) triplets的必要条件。

  9. 结论:Vision Transformer的输出的tokens能够独立地forwarded to 对应的MLP heads,来产生固定序列的predictions类似于DETR,用Bipartite matching loss来直接训练backbone,不需要牺牲性能来执行两个tasks是可能的。结果表明简单的模型能够实现和两阶段的方法类似的精度。

Comparison \(Fig.1^{[1]}\) Comparison between existing works and our proposed approach. (a) Traditional two-stage methods work on developing strong vision transformers that are applied in the domain of Action Localization by outsourcing the bounding box detections to an external detector. ROI Align is applied to the output of the transformer using the detected bounding boxes, and the pooled features are forwarded to an MLP that returns the class predictions. (b) Recent approaches in one-stage Action Localization leverage on the DETR capacity to model both the bounding boxes and the action classes. A video backbone produces strong spatio-temporal features that are handled by a DETR transformer encoder. A set of learnable queries are then used by a DETR transformer decoder to produce the final outputs. (c) Our method builds a vision transformer only that is trained against a bipartite matching loss between the individual predictions given by the output spatio-temporal tokens and the ground-truth bounding boxes and classes. Our method does not need learnable queries, as well as a DETR decoder, and can combine the backbone and the DETR encoder into a single architecture.

tokens \(Fig.2^{[1]}\) The output spatio-temporal tokens are fed to 3 parallel heads. We use the central tokens to predict the bounding box and the actor likelihood while averaging the output tokens over the temporal axis to generate the action tokens. Each head comprises a small MLP that generates the output triplets. We depict the flow diagram for each head, following the standard OWL-ViT head