YOWOv3

发表于 2024-08-26 更新于 2024-10-25 分类于 Papers 阅读次数：本文字数： 2.1k 阅读时长 ≈ 8 分钟

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition^[1]

作者是Nguyen Dang Duc Manh, Duong Viet Hang等人。论文引用[1]:Dang, Duc M et al. “YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition.” (2024).

Time

2024.Aug

Key Words

one-stage detector
different configurations to customie different model components
efficient while reducing computational resource requirements

总结

YOWOv3是YOWOv2的增强版，提供了更多的approach,用了不同的configurations来定制不同的model，YOWOv3比YOWOv2更好。
STAD是计算机视觉中一个常见的任务，涉及到检测location(bbox), timing(exact frame),and type(class of action)，需要对时间和空间特征进行建模。有很多的方法来解决STAD的问题，例如ViT，ViT的效果很好，但是计算量比较大。例如Hiera model由超过600M的参数，VideoMAEv2由超过1B的参数，增加了训练的成本和消耗。为了解决STAD问题，同时最大程度减弱训练和推理时间的成本，有人提出用了YOWO方法，虽然可以做到实时，但是也有限制：不是一个efficient model with low computational requirements。框架的作者已经停止维护了，但是还有很多的问题。本文的contribution如下：
- new lightweight framework for STAD
- efficient model
- multiple pretrained resources for application：creating a range of pretrained resources spanning from lightweight to sophisticated models to cater to diverse requirements for real-world applications。

YOWO的架构过时了，缺乏sophistication和davancements seen in contemporary models，限制了它的applicability和性能。YOWOv2基于YOWO，用了anchor-free object detection和FPN，提高了性能。但是增加了GLOP，和高效轻量的model的目标不相符。
Framework：YOWOv3采用了two-stream网络，包括连个processing streams，第一个用来提取 spatial information and context from the image using a 2D CNN，第二个stream，一个3D CNN，主要用来提取temporal information和motion。两个stream的输出结合到一起，来得到包含关于视频的spatial和temporal的信息。最后，用一个CNN layer基于这些extracted features进行预测。
- Spatial Feature Extractor：模型需要一个spaital feature extracto来提供关于location的信息。为了满足这个目的，采用了YOLOv8 model，去掉了其中的detection layer，输入这个module的是size为\([3,H,W]\)的feature map，代表final frame of the input video。通过利用pyramid network 架构，输出包含3个不同level的feature maps：\(F_{lv1}:[{C}_{2D},\frac{H}{8},\frac{W}{8}],F_{lv2}:[{C_{2D}},\frac{H}{16},\frac{W}{16}]\)， \(F_{lv3}:[C_{2D},\frac{H}{32},\frac{W}{32}].\)
- Decoupled head：decoupled head是用来separate 分类和回归的任务。YOLOX model团队发现，在早期的模型中，用单个feature map来做分类和回归 made training more challenging。因此，用了相似的approach，用两个独立的CNN streams for each task，来增强模型的comprehension。
\[F_{cls}=Conv_{cls2}(Conv_{cls1}(x))\\F_{box}=Conv_{box2}(Conv_{box1}(x))\]

2D backbone的输出包含3个feature map at different levels，每个feature map 送到Decoupled Head中来产生two feature maps for classification and regression，Decoupled Head的输入是一个tensor，\(F_{lv} : [C_{2D},H_{lv},W_{lv}]\),输出两个相同shape的tensors：\(F_{lv}:[C_{inter},H_{lv},W_{lv}].\)
- temporal motion feature extractor：为了增强预测action label的精确度，用了3D CNN模型，采用了I3D的模型，输入3D backbone的是一个tensor \(F_{3D}：[3, D, H,W]\)，是整个视频，输出是一个tensor：\(F_{3D}:[C_{3D},1,\frac{H}{32},\frac{W}{32}]\)。
- Fusion Head：Fusion head用于整合2D CNN和3D CNN streams的特征。输入这个layer包含两个tensors：\(F_{lv}:[C_{inter},H_{lv},W_{lv}]\)，\(F_{3D} : [C_{3D},1,\frac{H}{32},\frac{W}{32}]\)。首先 \(F_{3D}\)被squeeze to shape \([C_{3D},\frac{H}{32},\frac{W}{32}]\)，然后，upscale来match dimension \(H_{lv}\) 和 \(W_{lv}\)。接下来，\(F_{3D}\) 和 \(F_{lv}\) 被concatenated，来得到 tensor \(F_{concat}\)，最后， \(F_{concat}\)输入到 CFAM module，这是一个attention mechanism。CFAM module的输出是一个feature map \(F_{final}:[C_{inter},H_{lv},W_{lv}]\)。
- Detection Head：有人指出：拥有给Diarc distribution来预测bboxes会使模型训练困难，因此提出了让模型学习更general的distribution，而不是简单地回归到 single value。为了减小模型 dependence on selecting hyperparameters for predefined bboxes as in previous studies。也用了anchor-free。输入Detection Head的包含两个tensors：\(F_{cls}\) 和 \(F_{box}\) for classification and regression tasks respectively。通过一系列的卷积实现最后的预测： \[Predict_{cls}=\mathrm{conv}(Conv_{cls2}(Conv_{cls1}(F_{cls})))\\Predict_{box}=\mathrm{conv}(Conv_{box2}(Conv_{box1}(F_{box})))\]
Label Assignment：用了2个不同的label assignment mechanisms来match 模型的predictions with the ground truth labels from the data：TAL and SimOTA，SimOTA是OTA的简化版。两个机制都依赖于 \(d_{predict}\)和 \(d_{truth}\)之间的相似度测量函数，来perform matching between them。
- TAL：prediction \(d_{pred} \in A\) 和 ground truth \(d_{truth} \in T\)之间的相似度测量函数如下： \[\begin{aligned} &metric&& =cls_err^{\alpha}box_err^{\beta} \\ &cls_err&& =BCE(cls_{pred},cls_{truth}) \\ &box_err&& =CIoU(box_{pred},box_{truth}) \end{aligned}\]
对于each \(d_{truth} \in T\)，将它们和有最高的\(metric\)的 \(top_k\) 个 \(d_{pred}\) 进行匹配，但每个 \(d_{pred}\)只能和 most one \(d_{truth}\)进行匹配，如果 \(d_{pred}\) is in the \(top_k\) of multiple \(d_{truth}\)，将有着最高的 \(CIOU(box_{pred},boxJ_{truth})\)的 \(d_{pred}\)和\(d_{truth}\)进行匹配。另外，只考虑 receptive field的center落在 box of \(d_{truth}\)内、且 receptive field center到 box of \(d_{depth}\) center的距离不超过 radius的 \(d_{pred}\)。

如果 \(d_{pred}\)和 \(d_{truth}\)匹配，就认为 \(d_{pred}\)的target是 \(d_{truth}\)，probability of the target for corresponding class is set to 1 if they appear,否则就是0。
- SimOTA：和TAL类似，SimOTA利用了相似度测量，来matching between \(d_{pred}\)和 \(d_{truth}\)： \[\begin{aligned}metric&=BCE(\lambda cls_{pred},cls_{truth})-\alpha\log(\lambda)\\&\lambda=CIoU(box_{pred},box_{truth})\end{aligned}\]
不同于TAL，这里只考虑将 \(d_{truth}\) 和有smallest metric的 \(top_k\)个 \(d_{pred}\)进行匹配。SimOTA不会去fix \(top_k\)个 value，而是利用一个方式来估计 \(top_k\) for each \(d_{truth}\)。实验表明：用动态的 \(top_k\)会减慢训练过程，增加额外的计算开销。
Loss Function：用两个loss function对应两个 label assignment mechanisms。总的loss是由两部分： \[\mathcal{L}=\mathcal{L}_{box}+\mathcal{L}_{cls}\]

一个是bbox regression，一个是 loss for label classification，两个loss 里有多个subcomponents。对于 \(d_{pred} \in N\)， \(L_{box}\) =0, \(cls_{truth}\)也会完全为0.

TAL：Loss function为： \[\mathcal{L}=\frac{\mathcal{L}_{box}+\mathcal{L}_{cls}}\omega \]

\[\mathcal{L}_{box}=\delta(d_{truth})(\alpha CIoU(d_{pred},d_{truth})+\beta\mathcal{L}_{distribution})\\\mathcal{L}_{cls}=\gamma BCE(cls_{pred},cls_{truth})\]

\(L_{distribution}\)代表Distribution Loss function。

\[\begin{aligned}\delta(d_i)&=\sum_{p_j\in cls_i}p_j\\\omega&=\sum_{d_{pred}\in\mathcal{A}}\sum_{d_i\in\mathcal{M}(d_{pred})}\delta(d_i)\end{aligned}\]

\(\alpha, \beta,\gamma\) 是超参数，用来 scale components of the \(L\) function。fix \(\alpha\)=7.5, \(\beta\)=1.5, \(\gamma\)= 0.5。

SimOTA：loss function： \[\mathcal{L}=\frac{\mathcal{L}_{box}+\mathcal{L}_{cls}}{|\mathcal{P}|}\]

\[\mathcal{L}_{box}=\alpha CIoU(d_{pred},d_{truth})+\beta\mathcal{L}_{distribution}\\\mathcal{L}_{cls}=exp(cls_{t})|cls_{truth}-cls_{pred}|^{\nu}BCE(cls_{pred},cls_{truth})\]

\(L_{cls}\)是 generalized focal loss function，multiplied by a class balancing factor \(exp(cls_t)\)。

\[p_i\in cls_t=\begin{cases}class_ratio,&p_{truth}\neq0\\1-class_ratio,&p_{truth}=0\end{cases}\]

\(class_ratio\) 是class balancing factor,classes have lower frequencies，\(class_{ratio}\) will be higher。这里 \(\alpha=5.5,\beta=0.5, \gamma=0.5, \nu = 0.5\)

实验：在实验中，由于AVAv2.2是一个极度不平衡的dataset，因此，需要额外的方法来减小这个不平衡的影响，来提高overall mAP。解决这个问题的两个方法是：softLabels(qualified Loss)和inclusion of a class balance term。对于频繁出现的common classes，\(class_{ratio}\)会接近0.5, class balance term exp(cls_{t})会基本不变，意味着如果预测错误的话，loss不会有很大的改变；对于很少出现的classes，它会接近0,模型没有正确预测的话，\(exp(cls_t)\) 会严重地惩罚。这会创造一个bias，能够帮助提高 less common classes的预测。另外，soft labels用在了减小模型overconfidence的影响，特别是出现的很频繁的classes**。
- 对于label assignment one to many, 一个ground truth box 会和多个multiple predicted boxes 匹配，选择matching的boxes的数量可以被动态的估计或者predetermined。实验表明，选择\(k\)个能够产生很好的结果，自动估计 \(top_k\)导致额外增加 \(6%\)的训练时间。这些结果倾向于将 \(top_k\)视为一个超参数，而不是用来做优化的方法。
- 保留了Exponential Moving Average(EMA)的模型的变体，来评估EMA的影响。结果展示了EMA在初始的epochs中，对模型的性能有很大的影响，在后面的epochs中有很小的影响，表明EMA帮助模型在早期的阶段快速收敛，在之后的训练过程，能提高mAP score。
看了YOWOv2和YOWOv3的图之后，看上去貌似是一样的，画的不一样，不知道是不是就换了个backbone，具体还是看看代码

Structure \(figure 1^{[1]}\)：overview architecture of YOWOv3

CFAM \(figure 2^{[2]}\)：Overview of CFAM module

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition[1]

Time

Key Words

总结

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition^[1]