YOWOv3

YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition[1]

作者是Nguyen Dang Duc Manh, Duong Viet Hang等人。论文引用[1]:Dang, Duc M et al. “YOWOv3: An Efficient and Generalized Framework for Human Action Detection and Recognition.” (2024).

Time

  • 2024.Aug

Key Words

  • one-stage detector
  • different configurations to customie different model components
  • efficient while reducing computational resource requirements

总结

  1. YOWOv3是YOWOv2的增强版,提供了更多的approach,用了不同的configurations来定制不同的model,YOWOv3比YOWOv2更好。
  2. STAD是计算机视觉中一个常见的任务,涉及到检测location(bbox), timing(exact frame),and type(class of action), 需要对时间和空间特征进行建模。有很多的方法来解决STAD的问题,例如ViT,ViT的效果很好,但是计算量比较大。例如Hiera model由超过600M的参数,VideoMAEv2由超过1B的参数,增加了训练的成本和消耗。为了解决STAD问题,同时最大程度减弱训练和推理时间的成本,有人提出用了YOWO方法,虽然可以做到实时,但是也有限制:不是一个efficient model with low computational requirements。框架的作者已经停止维护了,但是还有很多的问题。本文的contribution如下:
    • new lightweight framework for STAD
    • efficient model
    • multiple pretrained resources for application:creating a range of pretrained resources spanning from lightweight to sophisticated models to cater to diverse requirements for real-world applications。
  1. YOWO的架构过时了,缺乏sophistication和davancements seen in contemporary models,限制了它的applicability和性能。YOWOv2基于YOWO,用了anchor-free object detection和FPN,提高了性能。但是增加了GLOP,和高效轻量的model的目标不相符。

  2. Framework:YOWOv3采用了two-stream网络,包括连个processing streams,第一个用来提取 spatial information and context from the image using a 2D CNN,第二个stream,一个3D CNN,主要用来提取temporal information和motion。两个stream的输出结合到一起,来得到包含关于视频的spatial和temporal的信息。最后,用一个CNN layer基于这些extracted features进行预测。

    • Spatial Feature Extractor:模型需要一个spaital feature extracto来提供关于location的信息。为了满足这个目的,采用了YOLOv8 model,去掉了其中的detection layer,输入这个module的是size为\([3,H,W]\)的feature map,代表final frame of the input video。通过利用pyramid network 架构,输出包含3个不同level的feature maps:\(F_{lv1}:[{C}_{2D},\frac{H}{8},\frac{W}{8}],F_{lv2}:[{C_{2D}},\frac{H}{16},\frac{W}{16}]\)\(F_{lv3}:[C_{2D},\frac{H}{32},\frac{W}{32}].\)
    • Decoupled head:decoupled head是用来separate 分类和回归的任务。YOLOX model团队发现,在早期的模型中,用单个feature map来做分类和回归 made training more challenging。因此,用了相似的approach,用两个独立的CNN streams for each task,来增强模型的comprehension。

    \[F_{cls}=Conv_{cls2}(Conv_{cls1}(x))\\F_{box}=Conv_{box2}(Conv_{box1}(x))\]

    2D backbone的输出包含3个feature map at different levels,每个feature map 送到Decoupled Head中来产生two feature maps for classification and regression,Decoupled Head的输入是一个tensor,\(F_{lv} : [C_{2D},H_{lv},W_{lv}]\),输出两个相同shape的tensors:\(F_{lv}:[C_{inter},H_{lv},W_{lv}].\)

    • temporal motion feature extractor:为了增强预测action label的精确度,用了3D CNN模型,采用了I3D的模型,输入3D backbone的是一个tensor \(F_{3D}:[3, D, H,W]\),是整个视频,输出是一个tensor:\(F_{3D}:[C_{3D},1,\frac{H}{32},\frac{W}{32}]\)

    • Fusion Head:Fusion head用于整合2D CNN和3D CNN streams的特征。输入这个layer包含两个tensors:\(F_{lv}:[C_{inter},H_{lv},W_{lv}]\)\(F_{3D} : [C_{3D},1,\frac{H}{32},\frac{W}{32}]\)。首先 \(F_{3D}\)被squeeze to shape \([C_{3D},\frac{H}{32},\frac{W}{32}]\),然后,upscale来match dimension \(H_{lv}\)\(W_{lv}\)。接下来,\(F_{3D}\)\(F_{lv}\) 被concatenated,来得到 tensor \(F_{concat}\),最后, \(F_{concat}\)输入到 CFAM module,这是一个attention mechanism。CFAM module的输出是一个feature map \(F_{final}:[C_{inter},H_{lv},W_{lv}]\)

    • Detection Head:有人指出:拥有给Diarc distribution来预测bboxes会使模型训练困难,因此提出了让模型学习更general的distribution,而不是简单地回归到 single value。为了减小模型 dependence on selecting hyperparameters for predefined bboxes as in previous studies。也用了anchor-free。输入Detection Head的包含两个tensors:\(F_{cls}\)\(F_{box}\) for classification and regression tasks respectively。通过一系列的卷积实现最后的预测: \[Predict_{cls}=\mathrm{conv}(Conv_{cls2}(Conv_{cls1}(F_{cls})))\\Predict_{box}=\mathrm{conv}(Conv_{box2}(Conv_{box1}(F_{box})))\]

  3. Label Assignment用了2个不同的label assignment mechanisms来match 模型的predictions with the ground truth labels from the dataTAL and SimOTA,SimOTA是OTA的简化版。两个机制都依赖于 \(d_{predict}\)\(d_{truth}\)之间的相似度测量函数,来perform matching between them。

    • TAL:prediction \(d_{pred} \in A\) 和 ground truth \(d_{truth} \in T\)之间的相似度测量函数如下: \[\begin{aligned} &metric&& =cls_err^{\alpha}box_err^{\beta} \\ &cls_err&& =BCE(cls_{pred},cls_{truth}) \\ &box_err&& =CIoU(box_{pred},box_{truth}) \end{aligned}\]

    对于each \(d_{truth} \in T\),将它们和有最高的\(metric\)\(top_k\)\(d_{pred}\) 进行匹配,但每个 \(d_{pred}\)只能和 most one \(d_{truth}\)进行匹配,如果 \(d_{pred}\) is in the \(top_k\) of multiple \(d_{truth}\),将有着最高的 \(CIOU(box_{pred},boxJ_{truth})\)\(d_{pred}\)\(d_{truth}\)进行匹配。另外,只考虑 receptive field的center落在 box of \(d_{truth}\)内、且 receptive field center到 box of \(d_{depth}\) center的距离不超过 radius\(d_{pred}\)

    如果 \(d_{pred}\)\(d_{truth}\)匹配,就认为 \(d_{pred}\)的target是 \(d_{truth}\),probability of the target for corresponding class is set to 1 if they appear,否则就是0。

    • SimOTA:和TAL类似,SimOTA利用了相似度测量,来matching between \(d_{pred}\)\(d_{truth}\)\[\begin{aligned}metric&=BCE(\lambda cls_{pred},cls_{truth})-\alpha\log(\lambda)\\&\lambda=CIoU(box_{pred},box_{truth})\end{aligned}\]

    不同于TAL,这里只考虑将 \(d_{truth}\) 和有smallest metric\(top_k\)\(d_{pred}\)进行匹配。SimOTA不会去fix \(top_k\)个 value,而是利用一个方式来估计 \(top_k\) for each \(d_{truth}\)。实验表明:用动态的 \(top_k\)会减慢训练过程,增加额外的计算开销。

  4. Loss Function:用两个loss function对应两个 label assignment mechanisms。总的loss是由两部分: \[\mathcal{L}=\mathcal{L}_{box}+\mathcal{L}_{cls}\]

一个是bbox regression,一个是 loss for label classification,两个loss 里有多个subcomponents。对于 \(d_{pred} \in N\)\(L_{box}\) =0, \(cls_{truth}\)也会完全为0.

  • TAL:Loss function为: \[\mathcal{L}=\frac{\mathcal{L}_{box}+\mathcal{L}_{cls}}\omega \]

\[\mathcal{L}_{box}=\delta(d_{truth})(\alpha CIoU(d_{pred},d_{truth})+\beta\mathcal{L}_{distribution})\\\mathcal{L}_{cls}=\gamma BCE(cls_{pred},cls_{truth})\]

\(L_{distribution}\)代表Distribution Loss function。

\[\begin{aligned}\delta(d_i)&=\sum_{p_j\in cls_i}p_j\\\omega&=\sum_{d_{pred}\in\mathcal{A}}\sum_{d_i\in\mathcal{M}(d_{pred})}\delta(d_i)\end{aligned}\]

\(\alpha, \beta,\gamma\) 是 超参数,用来 scale components of the \(L\) function。fix \(\alpha\)=7.5, \(\beta\)=1.5, \(\gamma\)= 0.5。

  • SimOTA:loss function: \[\mathcal{L}=\frac{\mathcal{L}_{box}+\mathcal{L}_{cls}}{|\mathcal{P}|}\]

\[\mathcal{L}_{box}=\alpha CIoU(d_{pred},d_{truth})+\beta\mathcal{L}_{distribution}\\\mathcal{L}_{cls}=exp(cls_{t})|cls_{truth}-cls_{pred}|^{\nu}BCE(cls_{pred},cls_{truth})\]

\(L_{cls}\)是 generalized focal loss function,multiplied by a class balancing factor \(exp(cls_t)\)

\[p_i\in cls_t=\begin{cases}class_ratio,&p_{truth}\neq0\\1-class_ratio,&p_{truth}=0\end{cases}\]

\(class_ratio\) 是class balancing factor,classes have lower frequencies,\(class_{ratio}\) will be higher。这里 \(\alpha=5.5,\beta=0.5, \gamma=0.5, \nu = 0.5\)

  1. 实验:在实验中,由于AVAv2.2是一个极度不平衡的dataset,因此,需要额外的方法来减小这个不平衡的影响,来提高overall mAP。解决这个问题的两个方法是:softLabels(qualified Loss)和inclusion of a class balance term。对于频繁出现的common classes,\(class_{ratio}\)会接近0.5, class balance term exp(cls_{t})会基本不变,意味着如果预测错误的话,loss不会有很大的改变;对于很少出现的classes,它会接近0,模型没有正确预测的话,\(exp(cls_t)\) 会严重地惩罚。这会创造一个bias,能够帮助提高 less common classes的预测。另外,soft labels用在了减小模型overconfidence的影响,特别是出现的很频繁的classes**。
    • 对于label assignment one to many, 一个ground truth box 会和多个multiple predicted boxes 匹配,选择matching的boxes的数量可以被动态的估计或者predetermined。实验表明,选择\(k\)个能够产生很好的结果,自动估计 \(top_k\)导致额外增加 \(6%\)的训练时间。这些结果倾向于将 \(top_k\)视为一个超参数,而不是用来做优化的方法。
    • 保留了Exponential Moving Average(EMA)的模型的变体,来评估EMA的影响。结果展示了EMA在初始的epochs中,对模型的性能有很大的影响,在后面的epochs中有很小的影响,表明EMA帮助模型在早期的阶段快速收敛,在之后的训练过程,能提高mAP score。
  2. 看了YOWOv2和YOWOv3的图之后,看上去貌似是一样的,画的不一样,不知道是不是就换了个backbone,具体还是看看代码

Structure \(figure 1^{[1]}\):overview architecture of YOWOv3

CFAM \(figure 2^{[2]}\):Overview of CFAM module