YOWO

发表于 2024-08-24 更新于 2024-08-27 分类于 Papers 阅读次数：本文字数： 3.7k 阅读时长 ≈ 13 分钟

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization^[1]

作者是来自Technical Univ of Munich的Okan Kopuklu, Xiangyu Wei, Gerhard Rigoll。论文引用[1]:Köpüklü, Okan et al. “You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization.” ArXiv abs/1911.06644 (2019): n. pag.

Time

2019.Nov.15(v1)
2021.Oct.18(v5)

Key Words

single-stage with two branches

总结

当前的网络抽取时序信息和keyframe的空间信息是用两个分开的网络，然后用一个额外的mechanism来融合得到detections。YOWO是一个单阶段的架构，有两个分支，来同时抽取当前的时序和空间信息，预测bboxes和action 的概率 directly from video clips in one evaluation。因为架构是统一的，因此可以端到端的优化。YOWO架构速度快，能够做到在16-frames input clips上做到 34 frames-per-second，62 frames-per-second on 8-frames input clips。是当前在STAD任务上最快的架构。

和静态图像里的目标检测相比，时序信息很重要，受目标检测FasterRCNN的启发，SOTA的工作将经典两阶段的网络架构扩展到action detection，第一阶段产生proposals，在第二阶段进行分类和定位的refinement，然而，两阶段的 STAD任务有3个主要的缺点：(1):action tube是由bboxes across frames组成的，它的产生比2D case更复杂和耗时。分类的性能极度依赖这些Proposals，然而detected bboxes可能对于后续的分类任务是sub-optimal；(2)action proposals只关注视频里的人物的features，忽略人和背景中的其它特征，而这这能够提供对于action prediction相当重要的上下文信息。(3)训练RPN网络和分类网络是分开的，不能保证找到全局最优。只有局部最优 from the combination of two stages can be found. 训练的成本比单阶段的高，因此花费很多时间和存储。
YOWO克服了上述提到的缺点，YOWO的本能的idea是来自人类视觉认知系统，为了理解视频中的人物的行为，需要将当前帧的信息2D features from key frame与之前记忆里获得的知识(3D features from clip)相关联，之后，两种features融合到一起，提供一个合理的结论。YOWO架构是一个单阶段的、有两个分支的网络。一个分支提取key frame的spatial features via a 2D-CNN，另一个分支models spatiotemporal features of the clip consisting of previous frames via a 3D CNN。YOWO是一个causal 架构(因果架构)，就是没有利用future frames，能够operate online on incoming video streams。为了aggregate 2D CNN和3D CNN的features smoothly, 用了一个channel fusion和attention机制，get the utmost out of inter-channel dependencies。最后，用融合的特征，产生frame-level detections，用一个linking 算法来产生action tubes. YOWO不局限于RGB的模态，其它的例如光流也是可以的；任何一个CNN的架构根据实时性能的要求，都可以用。YOWO operates with maximum 16 frames input，因为short clip lengths对于实现STAD人物的实时性是必要的。然而，small clip size是时序信息累计的限制因素。因此，利用long-term feature bank，通过训练好的3D CNN从整个视频中提取非重叠的8帧片段的特征。在推理的时候，averaged 3D features centering the key-frame.
主要贡献如下：
- 提出了在视频流里的单阶段的STAD框架，能够端到端的训练。实现了在3D CNN和2D CNN上特征的bboxes 回归，同时，这两个特征对于彼此是互补的 for final bboxes 回归和分类。用了channel attention 来汇聚两个branches的特征。实验证明：channel-wise attention机制，models inter-channel relationship within the concatenated feature maps，提高了性能。
Related work: 为了考虑时序信息，有twe-stream CNN来提取分别提取空间和时间特征，然后汇聚到一起；这样的工作大部分是基于光流的，很耗时和耗计算资源。然后就是3D CNN，用它来提取时空特征。为了resource efficiency，一些工作用2D CNN来学习2D 特征，然后用一个3D CNN来将它们融合到一起，学习时间特征。Attention是一个有效的mechanism来capture long-range dependencies，用在了CNNs中来尝试提高图像分类的性能。Attention mechanism在spatial-wise和channel-wise来执行。spatial attention解决inter-saptial relationship among features，channel attention 增强最有意义的channels，弱化其它的。作为一个channel-wise attention block，Squeeze-and-Excitation moduel对于提高CNN的性能有益。另一方面，对于视频分类人物，non-local block考虑时空信息，来学习across frames的特征的dependencies。可以视为自注意力策略。不同于之前的工作，YOWO只使用clip一次，检测keyframe里的对应的actions。为了避免光流的复杂计算，用keyframe的2D features和clip的3D features。之后，两种类型的features用attention mechanism融合在一起，就能够考虑到丰富的上下文信息。
YOWO的架构主要分为4个部分：3D CNN branch、2D CNN branch、CFAM 和 bbox regression parts.
- 3D CNN：因为上下文信息对于人类行为理解很重要，因此用3D CNN来提取时空特征。3D CNN能够获得运动信息(在空间和时间维度做卷积)，基本的3D CNN架构这里用的是 3D-ResNext-101，对于所有的3D CNN架构，在最后的卷积层之后的所有层都会被丢弃。3D网络的输入是一个video clip。是由时间上连续的视频帧组成，shape为 $C \times D \times H \times W$，最后一个3D ResNext-101的输出的feature map的shape为 $C' \times D' \times H' \times W'$，$C=3, D'=1, H' = \frac{H}{32}, W' = \frac{W}{32}$，$D$ 是输入的帧数。输出特征图的depth 维度减小到1，以至于 output volume可以squeezed to $C' \times H' \times W'$，为了和2D-CNN的输出匹配。
- 2D CNN：为了解决空间定位问题，keyframe的2D 特征被extracted in parallel，用Darknet-19作为2D CNN的基本架构，因为它在精度和效率上取得了平衡。key frame with the shape $C \times H \times W$ 是输入clip的 most recent frame，因此不需要一个额外的dataloader，Darknet-19的输出特征图的shape为 $C'' \times H' \times W'$，$C= 3$，$C''$ 是输出的channels，$H'= \frac{H}{32}, W' = \frac{W}{32}$，和3D CNN的情况类似。YOWO的另一个重要特性是，它的2D CNN和3D CNN的分支能够被任意的CNN架构代替，使其更灵活。需要注意的是：虽然YOWO有两个分支，但是它是一个统一的架构，能够端到端的训练。
- Feature aggregation: Channel Fusion and Attention Mechanism(CFAM) ：让3D和2D网络的输出有相同的shape in the last two dimensions，以至于两个feature maps能够简单地融合。用concatenation来融合两个feature maps。因此，融合的feature map能够 encoders both motion and appearance informatin，然后这个融合的特征输到CFAM模块里，这个模块是基于 Gram matrix来映射 inter-channel dependencies。虽然Gram matrix最初是用来做style transfer，最近用在了segmentation task，这样一个注意力机制对于来自不同sources的feature的融合是有益的。能够提高性能。concatenated feature map $A \in R^{(C' + C'') \times H \times W}$，可以被视为3D 和2D 信息的一个 abrupt combination，这忽略了它们之间的interrelationship。因此，将$A$输给两个conv layers来产生新的特征图 $B \in R^{C \times H' \times W'}$，之后，在特征图$B$上进行一些操作。$F R^{C N} $是 feature map $B$reshape之后的tensor， $N=H \times W$，意味着每个channel中的features被 vectorized to one dimension。然后对 $F \in R^{C \times N}$ 和它的转置 $F^T \in R^{N \times C}$ 进行矩阵乘法，来得到 Gram Matrix $G \in R^{C \times C}$，这个矩阵能够表明不同channel之间的correlations。
\[\begin{array}{rcl}\mathbf{G}&=&\mathbf{F}\cdot\mathbf{F}^\mathrm{T}&with&G_{ij}&=&\sum_{k=1}^{N}F_{ik}\cdot F_{jk}\end{array}\]

Gram matrix G中的每个元素 $G_{ij}$代表 vectorize feature maps i and j 之间的inner product。

在计算完Gram matrix之后，用一个softmax layer来得到 channel attention map M, $M \in R^{C \times C}$，$M_{ij}$表示$j^{th}$ channel对 $i^{th}$ channel之间的的影响。因此，$M$ summaries 给定feature map的features的inter-channel dependency。为了perform impact of attention map to original features，对$M$和$F$做一个矩阵乘法，将结果shape到3维空间 $R^{C \times H \times W}$，这将和输入的tensor有相同的shape。

\[\mathbf{F}^{\prime} = \mathbf{M}\cdot\mathbf{F}\\\mathbf{F}^{\prime}\in\mathbb{R}^{C\times N}\xrightarrow{reshape}\mathbf{F}^{\prime\prime}\in\mathbb{R}^{C\times H\times W}\]

channel attention module的输出 $C \in R^{C \times H \times W}$是 $F''$和初始输入特征图 $B$ 的的结合，with a trainable scalar parameter $\alpha$，用 element-wise sum operation，$\alpha$ learns a weight from 0。 \[\mathbf{C} = \alpha\cdot\mathbf{F}^{\prime\prime}+\mathbf{B}\]

该方程表明：每个channel的最后的特征是 weighted sum of the features of all channels and original features，这是对feature maps之间的long-range的semantics dependencies的建模。最后 \[\mathbf{C}\in\mathbb{R}^{C\times H^{\prime}\times W^{\prime}}\] 给到了两个conv layer，来得到 CFAM module 的输出特征图 \[\mathbf{D}\in\mathbb{R}^{C^{*}\times H^{\prime}\times W^{\prime}}\]，在CFAM模块的开始和结束的2 个conv layer很重要，因为它们 mix the features from different distributions，没有这些conv layers，CFAM模块的性能提升会有限。

这样的一个架构 promote feature representativeness in terms of inter-dependencies among channels，因此来自不同branches的features能够被reasonably and smoothly汇聚到一起。另外，Gram matrix 考虑整个 feature map。两个flattened feature vectors的点乘展示了 它们之间的relation information。一个比较大的product表明两个 channels的features more correlated，smaller product表明它们彼此不一样。对于一个给定的channel，allocate more weights to other channels which are much correlated and have more impact to it。通过这种机制，上下文的关系被 emphasized，features discriminability is enhanced。
- Bounding box regression：follow YOLO相同的guidelines for bbox regression。最后一个conv layer with $1 \times 1$ kernels用来产生 desired number of output channels。对于每个grid cel in $H' \times W'$。用 k-means 方法在对应的datasets上选择5个prior anchors, with NumCls class conditional action scores, 4 coordinates and confidence score**。YOWO的最后的输出的size是 $[(5\times(NumCls+5))\times H'\times W']$。bboxes的回归然后基于这些anchors进行refined。
在训练和测试阶段的输入分辨率都为 $224 \times 224$，用不同的分辨率进行multi-scale training在实验中没有发现有性能的提高。损失函数和原始的YOLOv2的网络中类似，除了这个采用了 smooth L1 Loss with beta=1 for localization，

\[L_{1,smooth}(x,y)=\begin{cases}0.5(x-y)^2&if|x-y|<1\\\\|x-y|-0.5&otherwise\end{cases}\]

$x,y$分别指prediction和ground truth。L1 loss相比于MSE loss，对 outliers不那么敏感，能够在某些情况阻止梯度爆炸。用MSE loss for confidence scores.

\[L_{MSE}(x,y)=(x-y)^2\]

最后的detection loss是 individual coordiante losses for x,y,width,height和confidence score loss，

\[L_D=L_x+L_y+L_w+L_h+L_{conf}\]

用focal loss for classification:

\[L_{focal}(x,y)=y(1-x)^\gamma log(x)+(1-y)x^\gamma log(1-x)\]

x是 softmaxed network prediction, $y \in {0,1}$ is grouth truth class label。$\gamma$是modulating factor，reduce loss of samples with high confidence(easy samples), increase the loss of samples with low confidence(hard samples)。AVA数据集是一个多标签数据集，每个人执行一个pose action和多个human-human or human-object interaction actions。因此，用 softmax to pose classes and sigmoid to the interaction actions。另外，AVA是一个不平衡的数据集，modulating factor $\gamma$ 不足以处理数据集的不平衡问题。因此用了一个focal loss的 $\alpha -balanced variant$，对于 $\alpha$，we have used exponential of class sample ratios。最后的YOWO用的loss是检测的loss和分类的loss的和。

\[L_{final}=\lambda L_D+L_{Cls}\]

这里 $\lambda =0.5$ 在实验中表现最好。
分别初始化3D 和2D CNN网络：用在Kinetics上预训练的models来初始化 3D CNN，用在PASCAL VOC上预训练的models来初始化2D CNN。虽然架构是由2D CNN和 3D CNN组成。这些参数能够一起更新。选择 mini-batch SGD with momentum and weight decay来优化loss function。学习率初始为 0.0001。在训练的时候，由于J-HMDB-21的样本数量少，冻结所有的 3D conv net的参数，因此收敛会更快，减小过拟合的风险。另外，在训练中用一些数据增强的方式例如flipping, random scaling等。
linking strategy：在得到了frame-level的 action detection之后，下一步是将这些检测到的 bboxes 连接起来，构建 action tubes in the whole video。利用连接算法，来找到最优的video-level action detections。假设 $R_t$和 $R_{t+1}$是连续帧 $t$ 和 $t+1$的两个区域。linking score for an action class $c$ 定义为： \[\begin{aligned}s_{c}(R_{t},R_{t+1})&=\quad\psi(ov)\cdot[s_{c}(R_{t})+s_{c}(R_{t+1})\\&+\alpha\cdot s_{c}(R_{t})\cdot s_{c}(R_{t+1})\\&+\beta\cdot ov(R_{t},R_{t+1})]\end{aligned}\]

$s_{c}(R_{t})$和$s_{c}(R_{t+1})$ 是regions $R_t$和 $R_{t+1}$的 class specific socres。$ov$是两个区域的IoU，如果overlap存在，则$\psi(ov)$为1，否则为0.在linking score的定义上增加了一个额外的项：$\alpha\cdot s_{c}(R_{t})\cdot s_{c}(R_{t+1})$。将两个连续帧的剧烈变化考虑进来了，能够提高video detections的性能。在计算出所有的linking scores之后，用 Viterbi algorithm来找到最有的路径，生成 action tubes。
long-term feature bank：虽然YOWO的推理是在线的和因果的 with small clip size，但是16帧的输入限制了 temporal information required for action understanding。因此，利用long-term feature back(LFB)，这个包含了不同的timestamps的来自3D CNN的features。在推理时，3D features centering the keyframe are averaeged and resulting feature map用作输入，给到CFAM block，LFB features are extracted for non-overlapping 8-frames clips using the pretrained 3D ResNeXt-101 backbone。用 8 features centering the key-frame。因此在推理的时候，利用了总共64帧的数据。LFB增加了action classificatin的性能，类似于difference between clip accuracy and video ccuracy in video datasets。然后，LFB会导致一个非因果的架构，因为 future 3D features在推理的时候用到了。

$Figure \ \ 1^{[1]}$: The YOWO architecture. An input clip and corresponding key frame is fed to a 3D CNN and 2D-CNN to produce output feature volumes of $[C'' × H' × W']$ and $[C' × H' × W']$,respectively. These output volumes are fed to channel fusion and attention mechanism (CFAM) for a smooth feature aggregation. Finally, one last conv layer is used to adjust the channel number for final bounding box predictions.

$Figure \ \ 2^{[1]}$: Channel fusion and attention mechanism for aggregating output feature maps coming from 2D-CNN and 3D-CNN branches

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization[1]

Time

Key Words

总结

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization^[1]