TubeR

发表于 2024-08-24 更新于 2024-11-08 分类于 Papers 阅读次数：本文字数： 3.7k 阅读时长 ≈ 13 分钟

TubeR: Tubelet Transformer for Video Action Detection^[1]

作者是来自阿姆斯特丹大学、罗格斯大学和AWS AI Labs的Jiaojiao Zhao、Yanyi Zhang等人。论文引用[1]:Zhao, Jiaojiao et al. “TubeR: Tubelet Transformer for Video Action Detection.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 13588-13597.

Time

2021.April

Key Words

learns a set of tubelet queries to pull action-specific tubelet-level features from a spatio-temporal video representation
spatial and temporal tubelet attention allows tubelets to be unrestricted in spatial location and scale over time
context aware classification head along with tubelet feature, takes the full clip feature from which our classification head can draw contextual information.
end-to-end without person detectors, anchors or proposals.

总结

不同于现有的依赖于离线检测器或者人工设计的actor-positional hypotheses like proposals or anchors，提出了一个通过同时进行action localization和recognition from a single representation，直接检测视频里的action tubelet的方法。TubeR学习一系列的tubelet queries，利用tubelet-attention module来model video clip里的动态的spatio-tempral nature。相比于用actor-positional hypotheses in the spatio-temporal space，它能够有效的强化模型的能力。对于包含transitional states或者scene changes的视频，提出了一个context aware classification head，来利用short-term和long-term context to strengthen action classification，和一个action switch regression head 来检测精确的时序上的行为范围。TubeR直接产生不同长度的action tubelets，对于长的视频clips，也能保持一个比较好的结果。

Action detection是一个复杂的任务，要求逐帧的任务的定位、将检测到的person instances连接成action tubes、预测action的类别。STAD中两种路径很流行：frame-level和video-level。frame-level的detection在每一帧上独立地进行检测和分类，然后将per-frame detections连接起来，形成连贯的action tubes；为了弥补时序信息的缺失，一些方法简单地重复2D proposals 或者离线的人物检测 over time，来得到时空特征。tubelet-level detection，直接生成spatio-temporal volumes from a video clip，来获得连贯和动态的natures of actions。通常预测action localization和classification jointly over spatio-temporal hypotheses。像 3D cuboid proposals。然而，这些 3D cuboids仅能够捕获很短的时间, 当由于相机运动或者人物移动的时候，空间位置会变化。理想情况下，这些模型能够用灵活的spatio-temporal tubelets，能够track the person over a longer time。但是，这样弄的话参数配置会比较大，限制了之前的方法只能用short cuboids。这里，作者提出了一个tubelet-level的detection approach，能够同时定位和识别action tubelets in a flexible manner，使得tubelets能够随着时间改变size和location。这使得系统能够利用更长的tubelets，在更长的时间上汇聚人物和它们的行为的视觉信息。

从NLP中的sequence-to-sequence获得启发，特别是机器翻译和它在目标检测上的应用，DETR。DETR是一个frame-level的action detection。这里，用decoder queries来表示整个视频序列上的人物和它们的行为，不限制tubelets是一个固定的cuboids。

提出了tubelet-transformer：称之为 TubeR for localizing and recognizing actions from a single representation。基于DETR的框架，TubeR学习一系列的 tubelet queries，来从spatio-temporal video representation中pull action-specific tubelet-level features。TubeR的包括特别的 spatial and temporal tubelet attention，使得tubelets能够随着时间在它们的spatial location 和scale上没有限制。克服了之前对于cuboids的限制。TubeR随着时间在一个 tubelet里回归 bboxes。考虑到tubelets之间的时序关联。汇聚visual features over the tubelet to classify actions。这个涉及表现很好，但是并没有比用离线人物检测器的方法好很多。猜想是因为query based features缺少全局上下文，only look at a single person的话，很难对涉及到的relationships 行为例如： listening-to或者talking-to进行分类。因此，提出了一个 context aware classification head， along with the tubelet feature, 利用完整的clip features，分类头能够得出上下文信息。这个设计能够使得网络将person tubelet和完整的scene context(where tubelet appears)关联起来。这个设计的限制就是 *context feature仅能够从 tubelet 占据的same clip中得到。包含long term contextual features对于最后的行为分类很重要。因此受到要锁和存储tubelet附近的视频内容的contextual features的启发，引入了memory system。用相同的feature injection策略，将long term contextual memory给到分类头。

主要的贡献如下：
1. 提出来TubeR：一个tubelet-level的transformer 框架 for human action detection
2. tubelet query和attention based formulation能够产生任意位置和尺寸的tubelets。
3. context aware classification head能够汇聚short-term 和long-term上下文信息。
相关工作：
- frame-level action detection：用2D positional hypothese(anchors) 或者离线的person detector on a keyframe 来定位actors，然后更多地关注提高action recognition。通过利用光流分支来包含temporal patterns。其它的用 3D convNet来获取时序信息来识别行为。不同于frame-level的方法，面向tubelet-level video action detection，用一个统一的configuration，来进行定位和识别。
- Tubelet-level action detection：通过将tubelet 作为一个representation unit来 detect actions变得流行了，有人重复 2D anchors per-frame来pooling ROI features，然后stack frame-wise features来预测行为类别。有人依赖严格设计的 3D cuboid proposal，前者直接俄检测tubelets，后者逐步第refines 3D cuboid proposals across time。除了box/cuboid anchors，也有人通过center position假设，来检测tubelet instances。基于假设的方法来处理长视频clips有很多困难。通过学习tubelet queries的子集，来表示tubelets的动态nature。将action detection task reformulate程一个 sequence-to-sequence 学习的问题，在一个tubelet里显式地model temporal correlations。
- Transformer-based action detection：Girdhar 提出来一个video action transformer network for detecting actions。用region-proposal-network for localization,通过汇聚actors附近的时空上下文信息，提高action recognition.
TubeR：TubeR的输入是一个video clip，直接输出一个 tubelet：a sequence of bboxes and the action label，TubeR是受 DETR的启发，但是将transformer架构 reformulate for sequence-to-sequence modeling in video。给定一个video clip $I R^{T_{in} H W C} $，$T_{in}, H, W, C$分别表示帧数，height,width和channel， TubeR首先用一个 3D backbone来提取 video feature $F_{\mathrm{b}}\in\mathbb{R}^{T^{\prime}\times H^{\prime}\times W^{\prime}\times C^{\prime}}$，$T'$ 表示 temporal dimension， $C'$是 feature dimension。用一个transformer的encoder-decoder来transform视频的特征 into a set of tubelet-specific feature $F_{\mathrm{tub}}\in\mathbb{R}^{N\times T_{\mathrm{out}}\times C^{\prime}}$，$N$是 tubelets的数量。为了处理长的video clips，用temporal 下采样，使得 $ T_{out} < T' < T_{in}$，减小 memory的 requirement。TubeR 产生了稀疏的sparse，对于短的video clips，去掉时序下采样，使得 $T_{out} < T' < T_{in}$，results in dense tubelets。Tubelet 回归和associated action classification能够用一个separated task heads 同时实现：

\[y_{\mathrm{coor}}=f(F_{\mathrm{tub}});y_{\mathrm{class}}=g(F_{\mathrm{tub}}),\]

$f$表示 tubelet 回归头， $y_{coor} \in \mathbb{R}^{N\times T_{\mathrm{out}}\times4}$ 表示$N$个 tubelets的坐标，each of which is across $T_{out}$ frames(或者 $T_{out}$ sampled frames for long clips)。$g$表示行为分类头，$y_{\mathrm{class}}\in\mathbb{R}^{N\times L}$表示 $N$个 tubelets with $L$ 个可能的labels的行为分类。
- TubeR Encoder：不同于普通的transformer的encoder，TubeR encoder用于处理 3D 时空中的信息。每个 encoder layer由 self-attention layer(SA)、两个normalization layer和一个 FFN组成。core attention layers的公式如下：
\[F_\mathrm{en}=\mathrm{Encoder}(F_\mathrm{b}),\] \[\mathrm{SA}(F_\mathrm{b})=\mathrm{softmax}(\frac{\sigma_q(F_\mathrm{b})\times\sigma_k(F_\mathrm{b})^T}{\sqrt{C^{\prime}}})\times\sigma_v(F_\mathrm{b}),\] \[\sigma(*)=\mathrm{Linear}(*)+\mathrm{Emb}_{\mathrm{pos}},\]

$F_b$是 backbone feature,$F_{en} \in R^{T'H'W' \times C'}$，$C'$表示 dimensional encoded feature embedding。$\sigma(*)$ 是线性变换加上 positional embedding。$Emb_{pos}$是 3D positional embedding。optional temporal down-sampling 能够用在backbone features上，来shrink 输入的sequence length to transformer for better memory efficiency。
- TubeR Decoder：
  - tubelet query：基于anchor 假设来直接检测tubelets是相当有挑战的。tubelet space along the spatio-temporal dimension相比于single frame bbox space来说是巨大的。考虑 FasterRCNN，requires for each position in a feature map with spatial size $H^{\prime}\times W^{\prime}, K(=9)$ anchors，总共有 $KH'W'$个anchors。对于一个across $T_{out}$ frames的tubelet来说，需要 $KH'W'^{T_{out}}$个anchors，来保持同样的sampling in space-time。为了减小tubelet space，一些方法通过忽略短的video clip中的action的空间位移，用 3D cuboids来近似tubelets。然而，视频clip越长， 3D cuboids代表的tubelet的精度越低。提出了学习tubelet queries小的子集， $Q{=}\{Q_{1},...,Q_{N}\}$, $N$是queries的数量。第 $i$个 tubelet query $Q_{i}=\{q_{i,1},...,q_{i,T_{\mathrm{out}}}\}$ 包含 $T_{out}$ box query embeddings $q_{i,t} \in R^{C'}$ across $T_{out}$ frames。学习一个tubelet query表示dynamics of a tubelet, 而不是手工设计的 3D anchors。初始化box embeddings identically for a tubelet query。
  - tubelet attention：为了model tubelet queries内的relations，提出来一个 tubelet-attention (TA) module，包含连个self-attention layers。首先有一个 spatial self-attention layer来处理一帧内的box query embeddings的空间relations。这个layer的intuition是识别actions 受益于 interactions between actors，或者between actos and objects in the same frame。接下来有 temporal self-attentin layer来models 同一个tubelet里的box query embeddings across tiem的correlations。这一层促使 TubeR query 去track actors，然后产生action tubelets，聚焦于single actors而不是一个fixed area in the frame。TubeR decoder把tubelet attention module用在了 tubelet queries $Q$上，来产生 tubelet query feature $F_{\mathfrak{a}}\in\mathbb{R}^{N\times T_{\mathrm{out}}\times C^{\prime}}$：
    
    $F_q = TA(Q)$
  - Decoder：decoder包含一个 tubele-attention module和一个 cross-attention (CA) layer，用来decode tubelet-specific feature $F_{tub}$ from $F_{en}$ to $F_q$：
  \[\mathrm{CA}(F_{q},F_{\mathrm{en}})=\mathrm{softmax}(\frac{F_{q}\times\sigma_{k}(F_{\mathrm{en}})^{T}}{\sqrt{C^{\prime}}})\times\sigma_{v}(F_{\mathrm{en}}),\\F_{\mathrm{tub}}=\mathrm{Decoder}(F_{q},F_{\mathrm{en}}).\]
  
  $F_\mathrm{tub}\in\mathbb{R}^{N\times T_\mathrm{out}\times C^{\prime}}$ 是tubelet specific features。有temporal pooling的时候，$T_{out} < T_{in}$，TubeR产生 sparse tubelets，对于 $T_{out} = T{in}$，TubeR产生 dense tubelets。
Task-Specific Heads：对于每个tubelet，bbox和action classification可以用独立的task-specific heads来处理。这样的设计最大化的减小了计算量。
- Context aware classification head：这个分类用一个简单的linear project就能实现。 \[y_{\mathrm{class}}=\mathrm{Linear_c}(F_{\mathrm{tub}}),\]
$y_{class} \in R^{N \times L}$表示在 $L$个可能的label上的分类的分数。one for each tubelet。
1. Short-term context head: 对于理解sequences，context是重要的。进一步提出利用spatio-temporal video context来帮助理解sequence。query the action specific feature $F_{tub}$ from some context feature $F_{context}$ to strengthen $F_{tub}$，得到了 feature $F_{c}\in R^{N \times C'}$，用于最后的分类： \[F_\text{c}=\text{CA}(\text{Pool}_t(F_\text{tub}),\text{SA}(F_\text{context}))+\text{Pool}_t(F_\text{tub}). Eq.9\]
这里设置 $F_{context} = F_b$ for utilizing the short-term context in the backbone feature。称之为 Short-term context head，$F_{context}$ 首先用一个自注意力层，然后 cross-attenion layer utilizes $F_{tub}$ to query from $F_{context}$。$F_{c}$经过线性层，用于最后的分类。
1. Long-term context head：为了利用long-range 的时序信息，但是在有限的memory下，采用两阶段的decoder for long-term context compression。
\[\mathrm{Emb}_{\mathrm{long}}=\mathrm{Decoder}(\mathrm{Emn}_{n1},\mathrm{Decoder}(\mathrm{Emb}_{n0},F_{\mathrm{long}}).\]

long-term context $F_{\mathrm{long}}\quad\in\quad\mathbb{R}^{T_{\mathrm{long}}\times H^{\prime}W^{\prime}\times C^{\prime}}$ 是一个buffer，包含从 $2W$个在时间上concatenated的相邻的clips抽取出来的backbone feature。为了将long-term video feature buffer压缩到 embedding $Emb_{long}$ with a lower temporal dimension，用了两个 stacked decoders with token $Emn_{n0}$ 和 $Emn_{n1}$。首先用一个压缩的token $Emb_{n0} (n0 < T_{long})$ to query $F_{long}$中重要的信息，得到一个temporal dimension 为 $n0$的中间压缩embedding。然后，进一步利用另外一个压缩的token $Emb_{n1} (n1 < n0)$ to query from 中间压缩的embedding，然后得到最后的压缩embedding $Emb_{long}$。$Emb_{long}$包含 long-term的视频信息，但是有着 lower temporal dimension $n1$，然后，对 $F_b$和 $Emb_{long}$采用cross-attention layer，来得到long-term context feature $F_{\mathrm{lt}}\in\mathbb{R}^{T^{\prime}\times H^{\prime}\times\bar{W}^{\prime}\times C^{\prime}}$：

\[F_{\mathrm{lt}}=\mathrm{CA}(F_{\mathrm{b}},\mathrm{Emb}_{\mathrm{long}}),\]

设置 $F_{context} = F_{lt} in Eq.9$，来利用 long-term context for classification。

Action Switch regression head $T_{out}$ bboxes in a tubelet是用一个 FC layer同时进行回归。

\[y_{\mathrm{coor}}=\mathrm{Linear}_{\mathrm{b}}(F_{\mathrm{tub}}),\]

$y_{\mathrm{coor}}\in\mathbb{R}^{N\times T_{\mathrm{out}}\times4}$， $N$是 action tubelet的数量， $T_{out}$是一个action tubelet的temporal length。为了去掉tubelet里的non-action boxes。进一步用 FC layer来决定 a box 是否描述了tubelet里actor的行为。称之为 action switch。这个action switch 使得能够产生action tubelets witha more precise temporal extent。$T_{out}$ predicted boxes in a tubelet的概率是： \[y_\mathrm{switch}=\mathrm{Linear}_\mathrm{s}(F_\mathrm{tub}),\]

$y_\mathrm{switch}\in\mathbb{R}^{N\times T_\mathrm{out}}$，对于每个预测的tubelet, each of its $T_{out}$ bboxes 包含一个action switch score。

Losses：4个loss的线性组合是： \[\mathcal{L}=\lambda_{1}\mathcal{L}_{\mathrm{switch}}(y_{\mathrm{switch}},Y_{\mathrm{switch}})+\lambda_{2}\mathcal{L}_{\mathrm{class}}(y_{\mathrm{class}},Y_{\mathrm{class}})\\+\lambda_{3}\mathcal{L}_{\mathrm{box}}(y_{\mathrm{coor}},Y_{\mathrm{coor}})+\lambda_{4}\mathcal{L}_{\mathrm{iou}}(y_{\mathrm{coor}},Y_{\mathrm{coor}}),\]

$y$是模型的输出，$Y$表示ground truth，action switch loss $L_{switch}$ 是一个binary cross-entropy loss，$L_{class}$ cross entropy loss，$L_{box}$ 和 $L_{iou}$ 表示per-frame bboxes matching error。当 $T_{out} < T_{in}$， tubelet是sparse，coordinate ground truth $Y_{coor}$是来自对应的时序下采样的frame sequence。用匈牙利匹配，根据经验，设置参数 $\lambda_{1}=1, \lambda_{2}=5, \lambda_{3}=2, \lambda_{4}=2$。
消融实验：
- benefit of tubelet queries：在实验中发现了tubelet query sets的好处，每个query set是由 $T_{out}$ per-frame query embeddings 组成，能够在各自的frame上预测spatial location of the action。将其和 single query embedding which represents a whole tubelet and must regress $T_{out}$ box locations for all frames in the clip.进行对比。结果是要好一些，证明了modeling action detection as a sequence-to-sequence task，能够有效地利用transformer的架构。
- effect of tubelet attention：tubelet attention相比于典型的self-attention，能够节省memory。
- benefic of action switch：action switch能够精确地判断action的temporal start and end。没有action switch，TubeR会将transitional states误分类为actions。
- effect of short and long term context head：在AVA数据集上有很好的性能提升，网络能够seeing full context of the clip。
局限：
- 3D backbone会占用很大的memory和计算量，限制了在长视频上应用 TubeR。近期的工作是将transformer的encoder用于video embedding，会占用较少的memory。
- 如果in one pass处理一个长视频，需要足够的queries来cover 视频中per-person的不同的最多的行为数量。这会造成在自注意力层中，需要大量的queries，造成memory问题。一个可能的解决方法是产生person tubelets而不是 action tubelets。因此当一个新的action发生的时候，不需要split tubelets。对于每个person instance，只需要一个query。

TubeR: Tubelet Transformer for Video Action Detection[1]

Time

Key Words

总结

TubeR: Tubelet Transformer for Video Action Detection^[1]