WOO

Watch Only Once:An End-to-end Video Action Detection Framework[1]

作者是来自港大的罗平老师组的Shoufa Chen、Peize Sun、Enze Xie等人。论文引用[1]:Chen, Shoufa et al. “Watch Only Once: An End-to-End Video Action Detection Framework.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 8158-8167.

Time

  • 2021.Oct

Key Words

  • end-to-end unified network
  • task-specific features

总结

  1. 提出了一个端到端的pipeline for video action detection。当前的方法要么是将video action detection 这个任务解耦成action localization和action classification这两个分离的阶段,要么在一个阶段里训练两个separated models。相比之下,作者的方法将actor localization和action classification弄在了一个网络里。通过统一backbone网络,去掉很多认为的手工components,整个pipeline被简化了。WOO用一个unified video backbone来提取features for actor location 和action localization,另外,引入了spatial-temporal action embeddings,设计了一个 spatial-temporal fusion module来得到更多的含有丰富信息的discriminative features,提升了action classification的性能。
  1. Video action detection 包含 actor bbox localizationaction type classification,当前方法的复杂性来自于actor localization和action classification之间的基本的困境。that is,*用一个single key frame is "positive" for actor localization, 但是"negative" for action classification,然而用多个frames有相反的影响。这是因为action localization需要一个2D的检测模型同预测video clip的key frame上的actor bbox。在这个阶段,考虑clip中的相邻帧会带来额外的计算和存储成本,相比之下,action classification严重依赖于一个 3D video model来提取video sequence中的时序的信息,单帧图像有很少的temporal motion representation for action classification。

    之前提出了两种可能的替代方法来解决这个困境。第一个是用离线的person detector,用来产生actor proposals,不是和action classification一起训练的,然后一个独立的video model用这些actor proposals和raw frames作为输入来预测action classes。单独的一个person detector是已经足够复杂了,which is pre-trained on the ImageNet and COCO human keypoint detection,and further fine-tuned on the target action detection dataset。这个方法比较复杂,计算成本高,需要两个separate models和两个训练阶段。更进一步,separate optimization on two sub-problems leads to a sub-optimal solution

    第二种类型的方法,是将actor detection和action classification models一起在单个training stage中进行联合训练。虽然训练的pipeline在某种程度上是简化了,两个模型仍然需要独立的从raw images中提取特征。因此,整个框架仍然有很高的computation和memory cost。

    一个很自然的问题是:有没有可能设计一个简单的unified的网络来解决actor localization和action classification in a single end-to-end model

    本文提出了 Watch Only Once(WOO)的框架,WOO直接预测actors的bbox和action classes from a video clip。"watch" the clip only once,就能够预测actor的位置和action的类别。方法主要包含3个key components:unified backbone、a spatial-temporal action embedding、a spatial-temporal knowledge fusion mechanism

    首先,设计了一个简单的有效的module,使得单个backbone能偶提供task-specific feature maps for actor localization head 和action classification head。这个module是轻量的,能够用来isolate keyframe features from all frames in the early stage of the backbone。这个动机是,当model走深的时候,keyframe能够得到more interaction with neighboring frames,提出的module能够很容易地插入到现有的backbone中,例如 I3D,X3D等。

    另外,注意到,同一个的架构tends to behave well for actor localization,但是对于action classification是有限的。这个困难是action detection主要lies in action classification。因此,怀疑单个backbone for both tasks是否会bias towards localization,然后找到一个undesired solution,因此对action classification的性能造成了影响。基于这个观察,提出了spatial和tempoal action embedding和interaction mechanism between them,来使得action classification features在spatial 和temporal perspectives更加的 discriminative。

    第三,提出了一个spatial-temporal fusion module来汇聚spatial和temporal knowledge,这个spatial properties例如shape和pose,temporal properties例如dynamic motion和temporal scale of action,结合在一起,通过spatial-temporal fusion module,来生成action features for actio classification。

    主要贡献如下:

    1. 提出了一个端到端的框架for video action detection,给定一个video clip作为输入,能够直接产生bboxes 和action classes,不需要独立的person detector(这个是在之前的工作中必不可少的).

    2. 提出了一个spatial-temporal embedding,一个embedding interaction mechanism,能够提高features的discriminativeness for action classificatin。一个spatial-temporal fusion module来进一步从spatial和temporal来汇聚features。

  2. 相关工作:

    • two-stage, two backbone:当前的STAD的SOTA模型通常用了两阶段的pipeline,用了两个backbone。这些方法简单地把STAD任务划分成actor localization和action classification。具体地说,在第一阶段,在COCO keypoint上预训练一个model,然后再目标STAD dataset上进行微调;第二阶段,用video clip的key frame作为第一阶段中得到的detection model的输入,来预测actor bboxes。然后将video clip和actor bboxes作为3D backbone的输入,来提取RoI区域的特征 for action class prediction。这些方法有很高的复杂度和很低的效率,因为sequential training stage和separate model架构。另外,在两个独立的阶段的独立的优化可能导致一个sub-optimal的结果

    • One-stage, two-backboneYOWO和ACRN通过同时训练2D的actor 检测网络和3D video model,来简化网络。然而,这里仍然有两个separate models来优化以YOWO为例,它包含一个在Kinetics上预训练的一个 3D model和 YOLO pre-trained on PASCAL VOC上的 2D model,有很高的计算量和memory burden。是虽然这个pipeline在一定程度上简化了。

    相比于这些方法,WOOrefreshing的简单:给定一个video clip,直接预测actor bboxes和对应的action classes。

    • End-to-end object detection:最近的端到端的目标检测框架,不需要任何手工设计的过程例如NMS,直接输出预测,实现了很好的性能。在这些工作中,DETR可以被视为端到端的目标检测方法,它采用了global attention mechanism和双向matching between predictions和ground truth objects。DETR抛弃了NMS步骤,实现了很好的性能。它在小目标上性能不太好,相比主流的检测器需要更长的训练时间。为了解决上述的问题,提出了Deformable-DETR,来将每个object query 限制在 a small set of crucial sampling points around reference points,而不是all points in the feature map。Deformable-DETR是高效和快速收敛的;Sparse RCNN利用一组稀疏的learned object proposals,以iterative way来进行分类和定位。Sparse RCNN和well-established的检测器相比,展示了精度、实时和训练收敛。在本次工作中,主要采用Sparse RCNN的检测头来做定位

    • Attention mechanism for action recognition:对于language相关的任务,注意力机制是一个流行的概念 。对于actio recognition,Non-local网络利用自注意力来得到不同时间或者空间features的dependencies。使得注意力机制applicable for action classification,有人利用non-local block作为一个long-term feature bank operator,使得video models能得到long-term information,提高了action detection的性能。

  3. Methods: \(X\quad\in\quad\mathbb{R}^{C\times T\times H\times W}\) 是一个layer的输入的spatial-temporal feature map。跟随前人的工作,in this work,将key frame放在video clip的中间,\(X_{t=\lfloor T/2\rfloor}\in\quad\mathbb{R}^{C\times H\times W}\)表示keyframe feature map

    • union backbone:在之前的video backbone中,key frame features将会和相邻的frame features通过temporal pooling或者3D卷积(temporal kernel size 大于1,会给key frame特征带来意想不到的disturbance) 进行interact。为了克服这个问题,video backbone设计的时候,将key frame和temporal interaction之前的早期的网络的features隔离开。

    和之前的backbon Slowfast(将spatial stride of res5设为1,用a dilation of 2 for its filters,来增加spatial resolution of res5 by \(2 \times\))不同,作者去掉了res5中的dialted conv,采用FPN module来做keyframe特征提取。FPN module用res2,res3,res4,res5的输出的keyframe feature作为输入。进一步用FPN的输出的特征 for actor localization,res5输出的特征来做action classification。为了这个目的,一个统一的action backbone用来提供task-relevant features。

    以上的设计有几个好处,首先,actor localization head采用层次化的feature representation作为source features。对于目标检测是有好处的。第二,用于actor localization的keyframe features通过FPN结构,和所有的video frames的features隔离开,starting at the early stage of the backbone。这能减少邻近帧的干扰,因为keyframe会随着model的深入,和邻近帧有更多的interaction。第三,相比于存在的两个backbone的网络(用独立的backbone) for actor localization,作者仅用一个轻量的FPN module that tasks image features as input,减少了参数和FLOPS。

    • Acotr Localization Head:受最近的Sparse RCNN的启发,设计了一个端到端的actor detection head for actor localization,detection head在得到hierarchical features from FPN之后,能够预测bbox和对应的scores indicating model's confidence on the box containing an actor。另外,person detector利用 set prediction loss for optimal bipartite matching between prediciton和ground truth at training stage,在验证的阶段不需要post-process。不同于two-backbone的方法,不许哟啊额外的预训练,因为person detector和action classifier共享一个backbone。

    • Action Classification Head:给定有person detector生成的 \(N\)个actor proposal boxes。用RoIAlign来提取每个box的spatial和temporal features,这两种类型的features然后融合,得到最终的action class prediction。细节如下:

      1. Spatial Action Features\(X_{5} \in \mathbb{R}^{C\times T\times H\times W}\)表示 res5得到的feature,在时间维度上进行一个global average pooling,得到一个spatial feature map,\(f^{s} \in \mathbb{R}^{C\times1\times H\times W}\),在 \(f^s\) 上用RoIAlign with \(N\) 个actor proposals,得到 \(N\)个 spatial RoI features。\(f_{1}^{s},f_{2}^{s},\cdots,f_{N}^{s} \in \mathbb{R}^{C\times S\times S},\)\(S \times S\)是 RoIAlign输出的spatial output size。

      2. Temporal Action Features:除了spatial action features,temporal properties也很重要,为了得到temporal motion information,从feature volume \(X_5\)中的every frame提取temporal features。因为这里主要关注temporal information,在spatial dimension上用一个global average pooling,来提取temporal RoI features。temporal action feature表示为\(f_{1}^{t},f_{2}^{t},\cdots,f_{N}^{t}\in \mathbb{R}^{C\times T\times1\times1}\)

      3. Embedding Interaction:为了得到discriminative features,为了增强instance的特性,引入了spatial 和temporal embedding to be convolved with aforementioned spatial and temporal features。spatial embedding期望能够encoder spatial properties例如shape,pose等。temporal embedding能够encode temporal dynamic properties,例如dynamics和action的temporal scale。注意到embedding是对于每个 \(N\) features是exclusive的。定义 \(E^{s}\in \mathbb{R}^{N\times d},E^{t}\in \mathbb{R}^{N\times d}\) for spatial and temporal embedding。\(E_n^s \in \mathbb{R}^d,E_n^t \in \mathbb{R}^d\) are working for n-th RoI feature。为了获得不同actors之间的relation 信息,构建了一个attention module for all RoI features。因为每个actor RoI有自己的spatial和temporal embedding,embedding相比于feature map更lighter,在对不同给的embedding之间而不是feature maps之间采用attention mechanism for efficiency。这里给定一个query element和一系列key elements,多头注意力module能够根据attention weights(measure compatibility of query-key pairs adaptively)来汇聚key contents。最后,\(x = (x_,...,x_n)\) 表示 \(n\)个input elements,输出 \(z= (z_1,...,z_n)\)\(z_i\)是weighted sum of a linearly transformed input:

      \[z_i=\sum_{j=1}^n\alpha_{ij}(x_jW^V).\]

      weight coefficient \(\alpha_{ij}\)是通过softmax计算出来的: \[\alpha_{ij}=\frac{\exp z_{ij}}{\sum_{k=1}^n\exp z_{ik}}, \text{where} z_ij=\frac{(x_iW^Q)(x_jW^K)^T}{\sqrt{d_z}}.\]\(E^s\)\(E^t\)送到 self-attention module,得到对应的输出 \(\phi^s,\phi^t\),和 原始的embeddings \(E^s, E^t\)有相同的shape,最后的action feature是: \[f=\mathcal{G}(\mathcal{F}(f^s,\phi^s),\mathcal{F}(f^t,\phi^t)),\]

      \(F\)是一个卷积操作 with parameters \(\phi\)\(G\)是spatio-temporal fusion操作。初始化 \(F\) with \(1 \times 1\) kernels for efficiency。

      1. Objective Function:提出的model以端到端的方式解决了定位和分类,整个的目标函数由对应的两部分组成: \[\mathcal{L}=\underbrace{\lambda_{cls}\cdot\mathcal{L}_{cls}+\lambda_{L1}\cdot\mathcal{L}_{L1}+\lambda_{giou}\cdot\mathcal{L}_{giou}}_{\text{set prediction loss}}+\underbrace{\lambda_{act}\cdot\mathcal{L}_{act}}_{\text{action}}.\]

      第一部分是 set prediciton loss,produces an optimal bipartite matching between predictions and ground truth objects。用 \(L_{cls}\) 表示cross-entropy loss over two classes(containing actor vs not containing actor)。\(L_{L1}\)\(L_{giou}\)是box loss。\(\lambda_{cls},\lambda_{L1},\lambda_{giou}\) 是常量,平衡这些loss的contributions。对于第二部分, \(L_{act}\)是一个binary cross entropy loss used for action classification,\(\lambda_{act}\) 是对应的weight。

  4. 实验:

    • Spatial-temporal fusion:有不同的instantiations of fusing temporal 和spatial action features:summation、concatenation和 cross-attention(CA),结果表明CA效果比另外两个好。
Motivation of WOO

\(Figure \ 1^{[1]}\): Motivation of WOO. (a) Previous dominant video action detection methods usually adopt two separate networks: an independent 2D detection model for actor localization from every key frames, and a 3D video model for action classification from video clips. (b) Our end-to-end unified framework uses a single backbone network to handle both 2D image detection and 3D video classification (i.e.2D spatial dimensions plus a temporal dimension). This unified backbone only “watches” an input video once, and directly produces both actor localization and action classification

Comparison of Backbone

\(Figure \ 2^{[1]}\): Comparisons of backbone architecture. (a) Two separate backbones for actor localization and action classification. Video backbone adopts res5 stage with dilated convolution (DC5). (b) A single union backbone which can provide task-specific features for actor localization and action classification simultaneously, enabling nearly cost-free feature extraction for actor localization compared to (a). Key frame features are illustrated in light orange color. Here we purposely omit the res2 features for visual simplicity

Action classification head

\(Figure \ 3^{[1]}\): Action classification head. Given the RoI feature of a specific box for T frames, spatial and temporal action features are generated. Then, spatial and temporal embedding is used to make action feature representation more discriminative through the interaction module. Finally, the multi-layer perceptron (MLP) takes as input the fused spatial-temporal feature and predicts the action class logits. See text for details

Structure of interaction module

\(Figure \ 4^{[1]}\): Structure of interaction module. Here we plot spatial embedding interaction as an example. ‘⊗’ denotes matrix multiplication, and ‘⊛’ denotes 1 × 1 convolution