YOWOv2

YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time STAD[1]

作者是来自哈工大的 Jianhuan Yang和Kun Dai,论文引用[1]:Yang, Jianhua and Kun Dai. “YOWOv2: A Stronger yet Efficient Multi-level Detection Framework for Real-time Spatio-temporal Action Detection.” ArXiv abs/2302.06848 (2023): n. pag.

Time

  • 2023.Feb

Key Words

  • combined 2D CNN of diffferent size with 3D CNN
  • anchor-free mechanism
  • dynamic label assignment
  • multi-level detection structure

总结

  1. YOWOv2利用了3D backbone和2D backbone的优势,来做accurate action detection。设计了一个multi-level detection pipeline来检测不同scales的action instances。为了实现这个目标,构建了一个 简单高效地2D backbone with FPN,来提取不同level的classification features和regression features。对于 3D backbone,采用现有的3D CNN,通过结合3D CNN和不同size的2D CNN,设计了YOWOv2 family, 包括:YOWOv2-Tiny,YOWOv2-Medium和YOWOv2-Large。同时引入了dynamic label assignment strategyanchor-free机制,来使得YOWOv2和先进的模型架构一致。YOWOv2比YOWO好很多,同时能够保证实时检测。
  1. 之前的工作中,由于3D CNN的网络的计算量比较大,实时性比较差;因此有人用一个参数共享的2D CNN网络,来提取spatial features frame by frame,然后把他们放到buffer中,在那之后,他们仅处理新的input frame,将它的spaital feature和features in the buffer一起形成spatio-temporal features for the final detection。然而,这样的pipeline,不能很好地model temporal association,实时的detection只能在RGB streams的条件下实现,当用了光流的时候,尽管表现不错,但是速度下降了很多。之前的工作YOWO,效果很不错,但是也有两个不足。

    1. YOWO是一个one-level detector,在一个low-level feature map上进行最后的detection,损失了small action instances的性能。
    2. YOWO是一个anchor-based 方法,有很多的anchor boxes with many hyperparameters,例如数量、尺寸和anchor boxes的aspect ratio。这些超参数必须仔细地设计,泛化性不够。

    总的来说,设计一个实时的detection framework for spatio-temporal action detection task仍然是一个挑战。 本文中,提出了一个新的实时的action detector,YOWOv2,YOWOv2包括一个3D backbone和multi-level的2D backbone,辛亏有multi-level 2D backbone with FPN,YOWOv2设计了一个multi-level detection pipeline来检测action instances of different scales.对于3D backbone,也用一个高效的3D CNN。另外,利用anchor-free mechanism,避免了anchor box的缺点,因为anchor box去掉了,采用了一个dynamic label assignment strategy,进一步提高了YOWOv2的多功能性。通过结合3D backbone和不同尺寸的2D backbones,构建了多个YOWOv2的模型,包括YOWOv2-Tiny,YOWOv2-Medium和YOWOv2-Large for the platforms with different computing power。

    相比于YOWO,YOWOv2实现了更好的性能。另外,YOWOv2能够实时运行,相比于其它的实时的action detector,YOWOv2实现了更好的性能,contributions如下:

    • YOWOv2是一个multi-level detection structure to detect small action instances。
    • YOWOv2是一个anchor-free detection pipeline
    • 通过结合3D backbones和2D backbones of different sizes for the platforms with different computing power。
  2. 相关工作:STAD需要一个action detector来locate和identify当前帧中所有的instances。如何提取spatio-temporal features对于精确的action detection是重要的。

    • 3D CNN-based:一些研究人员用3D CNN来设计actio detectors,Girdhar用I3D来生产action region proposals,然后用Transformer完成最后的detection;也有人用3D CNN来encoder input video,然后用Transformer with the tuber queries for final detection。虽然3D CNN的方法很成功,但是需要大量的计算。实时性比较差
    • 2D CNN-based:另外一种方式是将spatio-temporal associations进行解耦,然后设计一个2D CNN-based action detectors for efficient detection。有人设计了一个one-stage detection framework,ActionTubelet(ACT):先用SSD来从video clip中的每帧提取spatial features然后stacked,再用一个detection head来处理stacked spatial features for the final detection。然后有人follow ACT的工作,设计了一个anchor-free one-stage action detector Moving Center。有人用自注意力增强了MOC,然而,这些方法的实时性是仅当输入时RGB的时候才可以,一旦加上了光流,尽管performance gain,但是速度是下降了。Moreover,高质量的光流需要离线获得,不满足在线的操作。
  3. Methodology: 给定 \(K\)个frames的video clip \(V = \{I_{1},I_{2},\ldots,I_{K}\}\) where \(I_K\)是当前帧,YOWOv2用一个高效的3D CNN作为3D backbone来提取spatio-temporal features,\(F_{ST} \in \mathbb{R}^{\frac{H}{32}\times\frac{H}{32}\times C_{o_{2}}}\), YOWOv2的2D backbone是一个multi-level 2D CNN,用于输出解耦的multi-level spatial features \(F_{cls} = \{F_{cls_i}\}_{i=1}^{3}\)\(F_{reg} = \{F_{reg_i}\}_{i=1}^3\mathrm{~of~}I_K\),where the \(F_{cls_{i}}\in\mathbb{R}^{\frac{H}{2^{i+2}}\times\frac{W}{2^{i+2}}\times C_{o_{1}}}\) 是分类的features,\(F_{reg_{i}} \in \mathbb{R}^{\frac{H}{2^{i+2}}\times\frac{W}{2^{i+2}}\times C_{o_{1}}}\)是回归的features。在这两个backbone之后,用了两个channel encoders on each feature map of level to integrate features。在这之后,两个额外并行的branches with two \(3 \times 3\) conv layers followed the channel encoders 来预测 \(Y_{cls_{i}}\in \mathbb{R}^{\frac{H}{2^{i+2}}\times\frac{W}{2^{i+2}}\times N_{C}}\) for classification。一个confidence branch 加在了actionness confidence。

    • Design of YOWOv2
    1. 2D backbone:2D backbone是用来抽取当前帧的multi-level spatial feature。考虑性能和速度的balance,从先进的object detectors中得到了一些ideas。重新用了YOLOv7的backbone和FPN来节省时间,在FPN之后,加了一个额外的\(1 \times 1\) conv layer 来压缩每个level feature map的channel number \(F_{s_i}\) to \(C_o1\),默认设置为256。加了两个并行的branches with two \(3 \times 3\) conv layers to output decoupled features。

    \[F_{cls_{i}}=f_{conv_{2}}^{1}\left(f_{conv_{1}}^{1}\left(F_{S_{i}}\right)\right)\\F_{reg_{i}}=f_{conv_{2}}^{2}\left(f_{conv_{1}}^{2}\left(F_{S_{i}}\right)\right)\]

    \(f^i_{convj}\) 是第\(i\)个branch的第\(j\)\(3 \times 3\) conv layer。

    在YOWOv2的框架中,2D backbone输出3个level的解耦的feature maps。 \(F_{cls} = \{F_{cls_{i}}\}_{i=1}^{3}\)\(F_{reg} = \{F_{reg_{i}}\}_{i=1}^{3}\),为了方便,称2D backbone的为FreeYOLO。通过控制FreeYOLO的depth和width,设计了两个不同size的FreeYOLO,FreeYOLO-Tiny for YOWOv2-Tiny,FreeYOLO-Large for YOWOv2-Medium和YOWOv2-Large。为了加速训练,在COCO上预训练带有一个额外 \(1 \times 1\) conv layers的2D backbone。

    1. 3D backbone:3D backbone用来从video clip中提取spatio-temporal features \(F_{ST}\) for spatio-temporal association。采用高效的3D CNN来减小计算,保证实时检测,为了和decoupled spatial features融合,简单地upsample \(F_{ST}\) 来得到 \(\{F_{STi}\}_{i=1}^{3}\),公式如下:

    \[\begin{aligned}&F_{ST_{1}}=\mathrm{Upsample}_{4\times}\left(F_{ST}\right)\\&F_{ST_{2}}=\mathrm{Upsample}_{2\times}\left(F_{ST}\right)\\&F_{ST_{3}}=F_{ST}\end{aligned}\]

    上采样操作是为了在空间维度上对齐 \(F_{ST_i} \in \mathbb{R}^{\frac{H}{2^{i+2}}\times\frac{H}{2^{i+2}}\times C_{o_2}}\)\(F_{cls_i}\)\(F_{reg_i}\)

    • ChannelEncoder:YOWO提出了Channel Encoder,用来融合2D 和3D backbone出来的features。给定一个 \(F_{S} \in \mathbb{R}^{H_{o}\times W_{o}\times C_{o_{1}}}\)\(F_{ST} \in \mathbb{R}^{H_{o}\times W_{o}\times C_{o_{2}}}\),这个Channel Encoder首先在channel dimension上进行concatenates,用两个conv layer followed a BN 和LeakyReLU来实现主要的channel integration: \[F_f=f_{conv_2}\left(f_{conv_1}\left(\text{Concat}\left[F_S,F_{ST}\right]\right)\right)\]

    \(F_{f} \in \mathbb{R}^{H_{o}\times W_{o}\times C_{o_{3}}}\),Concat是要给channel concatenation操作。然后 \(F_f\)reshape成\(F_{f_{2}}\in\mathbb{R}^{C_{o_{3}}\times H_{o}W_{o}}\),用来做之后的 self-attention mechanism inspired by DANet,这样包括不同levels的features能够完全的集成。

    \[F_{f_{3}}=\text{Softmax}\left(F_{f_{2}}F_{f_{2}}^{T}\right)F_{f_{2}}\]

    最后,\(F_{f_{3}} \in \mathbb{R}^{C_{o_{3}} \times H_{O}W_{o}}\) 通过另一个conv layer被reshape成 \(F_{f} \in \mathbb{R}^{H_{o}\times W_{o}\times C_{o_{3}}}\)

    • Decoupled fusion head:YOWOv2中,2D backbone输出当前帧的decoupled spatial features,3D backbone输出的是video clip上采样的 \(F_{ST}\)注意到,\(F_{cls_{i}}\)\(F_{reg_{i}}\)包含不同的语义信息,因此需要将它们分别和 \(F_{ST_{i}}\)融合,因此设计了一个 decoupled fusion head 来独立地融合 \(F_{ST_{i}}\) into \(F_{cls_{i}}\)\(F_{reg_{i}}\)

    \[F_{cls_{i}}^{f}=\text{ChannelEncoder }(F_{cls_{i}},F_{ST_{i}})\\F_{reg_{i}}^{f}=\text{ChannelEncoder }(F_{reg_{i}},F_{ST_{i}})\]

    在feature aggregation之后,用两个并行的branches on each level 来做最后的detection,设计很简单,一个是分类的branch,一个是box regression branch。

    对于分类branch,输出分类的预测 \(Y_{cls_{i}}\), \(Y_{cls_{i}}\) 表示 \(Y_{cls_{i}}\)上面的每个空间位置的action instances的概率。 \(N_c\)是action classes的数量。Taking \(F^f_{cls_{i}}\),分支应用两个 \(3 \times 3\) conv layers,每个有 \(C\)个filters,followed by SiLU activations。最后,一个 conv layer with \(N_c\) filters和sigmoid activations来输出 \(N_c\)个binary predictions per spatial position。

    对于box regression 分支,输出box regression prediction \(Y_{reg_{i}}\)\(Y_{reg_{i}}\)代表每个空间位置的4个relative offsets。另外,一个额外的 $1 $ conv layer with 1 filter加在了这个分支上for actioness confidence prediction,\(Y_{conf_{i}}\),注意到,在每个spatial position,没有anchor box,YOWOv2是 anchor-free

    • Loss assignment: 因为YOWOv2是anchor-free的action detector without any anchor boxes,multi-lelve label assignment重要。最近,dynamic label assignment在目标检测领域很成功,受YOLOX的启发,用SimOAT for the label assignment of YOWOv2。具体地,计算所有predicted bboxes和ground truths之间的cost。每个ground truth分配了\(top_k\)个预测的bboxes with least cost,\(k\)是由预测的bboxes和target bboes之间的IoU决定的。

    \[c_{ij}\left(\hat{a}_i,a_j,\hat{b}_i,b_j\right)=L_{cls}(\hat{a}_i,a_j)+\gamma L_{seg}(\hat{b}_i,b_j)\]

    \(\hat{\alpha}_i\)\(\alpha_j\)是分类的预测和target,\(\hat{b}_i\)\(b_j\)是回归的预测和target。\(\gamma\)是cost平衡系数。

    • Loss function\[\begin{aligned} L(\{a_{x,y}\},\{b_{x,y}\},\{c_{x,y}\})& =\frac{1}{N_{pos}}\sum_{x,y}L_{conf}(\hat{c}_{x,y},c_{x,y}) \\ &+ \frac{1}{N_{pos}}\sum_{x,y}\mathbb{I}_{\{\hat{a}_{x,y}>0\}}L_{cls}(\hat{a}_{x,y},a_{x,y}) \\ &+\frac{\lambda}{N_{pos}}\sum_{x,y}\mathbb{I}_{\{\hat{a}_{x,y}>0\}}L_{reg}(\hat{b}_{x,y},b_{x,y}) \end{aligned}\]

    \(L_cls\)是binary cross-entropy,\(L_{reg}\)是GIoU损失,\(a_{x,y}\)\(b_{x,y}\)\(c_{x,y}\)分别是分类的预测、回归的预测和confidence prediction。\(\hat{a}_{x,y}\)\(\hat{b}_{x,y}\)\(\hat{c}_{x,y}\) 是groundtruths。\(\mathbb{I}_{\{\hat{a}_{x,y}>0\}}\)是indicator function,\(\hat{a}_{x,y}\) > 0的时候为1,否则为0。\(N_{pos}\)是positive的预测的数量。\(\lambda\)是损失的平衡系数,在实验中为5.

  4. 实验:实验中表明,decoupled feature fusion是有必要的,因为categorical 和regressive features的语义信息不一样;3D CNN和光流的方法都会导致计算量过大,很难保证实时性。

Structure \(Fig. \ 1^{[1]}\). Overview of YOWOv2. YOWOv2 uses upsampling operation to align the spatio-temporal features output by the 3D backbone with the spatial features of each level output by the 2D bakcbone and uses the Decoupled fusion head to achieve the fusion of the two features on each level. Finally, YOWOv2 outputs the multi-level confidence predictions, classification predictions, and regression predictions respectively.

channel encoder \(Fig. \ 2^{[1]}\). Overview of ChannelEncoder. It contains the channel fusion and channel self-attention mechanism, which are both used to fuse 2D and 3D features.

Coupled fusion head \(Fig. \ 3^{[1]}\). Coupled fusion head. In the coupled head, the spatial features from the 2D backbone is also coupled which means that the parallel 3 × 3 conv layers after the FPN are removed.