Video Understanding

发表于 2024-08-20 更新于 2024-10-25 分类于 Papers 阅读次数：本文字数： 6.1k 阅读时长 ≈ 22 分钟

视频理解及分析的计算机视觉任务

之前看的时候，不管是论文还是一些博客，感觉都不是很清晰和全面，大家的定义不全面，特别是英文的名称上，这里写一下我的理解：
几个任务：
- 行为识别(Action Recognition): 实质是对视频的分类任务，可以类别图像领域的分类任务
- 时序动作定位(Temporal Action Localization): 在时间上对视频进行分类，给出动作的起止时间和类别
- 时空行为检测(Spatio-Temporal Action Detection): 不仅识别出动作出现的区间和类别，还要在空间范围内用一个bounding box标记处目标的位置。
- 还有人提出了时空动作定位(Spatio-temporal Action localization)：和上一个是一样的
- Action Detection在Paperswithcode上的定义： aims to find both where and when an action occurs within a video clip and classify what the action is taking place. Typically results are given in the form of action tublets, which are action bounding boxes linked across time in the video. This is related to temporal localization, which seeks to identify the start and end frame of an action, and action recognition, which seeks only to classify which action is taking place and typically assumes a trimmed video.
- 论文里还提到了temporal action segmentation：针对细粒度的actions和videos with dense occurrence of actions to predict action label labels at every frame of the video.
时空行为检测的算法：之前的论文都是都是基于行为识别(action recognition)的，很多都是基于早期的Slowfast的那个检测的方式：需要一个额外的检测器，实现行为检测。也就是在行为识别的基础上，再进行时空行为检测。但这并不是我理想中的方式，所以很多行为识别的算法，在AVA上也能上榜；最近看VideoMAE看了之后，就一直在看这个，没有去看看其它的。
Action Detection数据集：
- J-HMDB
- UCF101-24
- MultiSports
- AVA
- 其中，JHMDB和UCF101-24是密集标注数据集(每一帧都标注，25fps)，这类数据集每个视频只有一个动作，大部分视频是单人做一些语义简单的重复动作；AVA为代表的稀疏标注数据集(隔一段时间标注一帧，1fps)，没有给出明确的动作边界

Deep Learnign-based Action Detection in Untrimmed Videos: A Survey

作者是来自纽约城市大学的Elahe Vahdani and Yingli Tian,论文引用[1]:Vahdani, Elahe and Yingli Tian. “Deep Learning-Based Action Detection in Untrimmed Videos: A Survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2021): 4302-4320.

很多action recognition 算法是在untrimmed video里，真实世界中的视频大部分是漫长的和untrimmed with sparse segments of interest. temporal activity detection在没有剪辑的视频里的任务是定位行为的时间边界和分类行为类别。spatio-temporal action detection：action在temporal和spatial维度上都进行定位，还需识别行为的类别。因为长的未修剪的视频的标注费时费力，所以 action detection with limited supervision 是一个重要的研究方向。
Temporal Action Detection 旨在在untrimmed video里找到精确的时间边界和行为实例的label。依赖于训练集的标注的availability，可以分为：
- 全监督 action detection: 时间边界和labels of action instances are available
- 弱监督action detection：only the video-level labels of action instances are available, the order of action labels can be provided or not.
- unsupervised action detection: no annotations for action instances
- semi-supervised action detection: 数据被划分为小的子集 $S_1$ 和大的子集 $S_2$，$S_1$ 中的视频是全标注的，$S_2$中的视频没有标注(as in fully-supervised)或者only annotated with video-level labels(as weakly-supervised)
- self-supervised action detection: 用一个代理任务从数据中抽取信息，然后用于提高性能，例如一般的自监督预训练，然后有监督微调。
- Temporal action detection：或者说temporal action localization, 思路和图像中的目标检测类似，会用到 proposals、RoI pooling这些类似的思路。
Untrimmed videos通常很长，由于计算资源的限制，很难直接把整个视频给到visual encoder 来提取特征。通常的做法是把视频划分为相同大小的temporal intervals called snippets，然后对每个snippet都用visual encoder。
Spatio-temporal Action Detection:有frame-level action detection和clip-level action detection
- frame-level action detection: 早期的方式是基于滑动窗口的一些扩展方法，要求一些很强的假设：例如cuboid shape，一个actor的跨帧的固定的空间范围。图像的目标检测启发了识别人类行为 at frame level; 第一阶段，通过region proposal 或者 densely sampled anchors 产生action proposals，然后第二阶段proposals用于action classification和localization refinement. 在检测frames里的action regions之后，一些方法，用光流来获取运动信息，用linking algorithm来连接frame-level bounding box into spatio-temporal action tubes；有人用dynamic programming approach来连接 resulting per-frame detection，这个cost function是基于boxes的检测分数和连续帧之间的重叠；也有人用 tracking-by-detection方法来代替linking algorithm；另外一组是依赖于actionness measure, 例如pixel-wise probability of containing any action. 为了估计actionness，它们用low-level cues例如光流；通过thresholding the actionness scores来抽取action tubes，这个输出是action的rough localization。这些方法的主要的缺点是没有完全利用视频的时序信息，检测是在每一帧上独立做的，有效的时序建模是很重要，因为当时序上下文信息是可用的时候，大量的actions才是可识别的。
- Clip-level action detection: 通过在clip level执行action detection来利用时序信息。Kalogeiton提出了action tubelet detector(ACT-detector)，输入为一系列的frames，输出action来别和回归的tubelets：系列的带有associated scores的bounding box。tubelets被连接，来构造action tubes。Gu等人进一步通过用longer clips和利用I3D pre-trained on the large-scale video dataset，展示了时序信息的重要性。为了生成action proposals，把2D region proposals扩展到3D，假定spatial extent在一个clip内是固定的。随着时间的推移，用较大空间唯一的action tubes将要违反假设，特别是当clip是很长，涉及actors或者camera的快速移动。
- modeling spatio-temporal dependencies: 理解人类行为要求理解它们身边的人和物体。一些方法用图结构的网络、注意力机制来汇集视频中的物体和人的上下文信息。时空关系通过多层图结构的自注意力来学习，这个能够连接连续clips的entities，因此考虑long-range spatial and temporal dependencies。
- Metrics for Spatio-temporal action detection: frame-AP: measures the area under the precision-recall curve of the detections for each frame. frame中的IoU大于某个阈值且action label是正确的，则detection 是正确的。video-AP: measures the area under the precision-recall curve of the action tubes predictions。如果整个视频帧中，mean per frame IoU大于某个阈值且action label预测正确，则tube 是正确的.

A Survey on Deep Learning-based Spatio-temporal Action Detection

作者是来自浙大和蚂蚁集团的Peng Wang, Fanwei Zeng和Yuntao Qian，论文引用[2]:Wang, Peng et al. “A Survey on Deep Learning-based Spatio-temporal Action Detection.” ArXiv abs/2308.01618 (2023): n. pag.

Spatio-temporal action detection(STAD) 旨在对视频中出现的行为进行分类，然后在空间和时间上进行定位。传统的STAD方式涉及到了滑动窗口，例如deformable part models, branch and bound approach. 模型主要划分为2类：frame-level和clip-level；frame-level预测 2D bounding box for a frame; clip-level预测 3D spatio-temporal tubelets for a clip.
Frame-level: 目标检测做的很成功，研究人员将目标检测的模型泛化到STAD 领域，直接的思路是：把STAD in video视为 2D image 检测的集合。具体地说，在每一帧上用action detector来检测得到 frame-level 2D bounding box。然后用linking or tracking算法关联这些frame-level detection results，生成 3D action proposals。作者从Temporal context、3D CNN、High efficiency and real-time speed、Visual Relation Modeling这几个角度给出了相关的算法。

有些借鉴了 RCNN、FasterRCNN的思路，用了RPN网络，然后用两个分支分别处理RGB和光流；然后融合外观和运动信息，Linked up里得到 class-specific action tubes. 也有基于actionness maps的方法。actionness是指在图像的特定位置包含一般的action instance的可能性。上述这些STAD方法独立的对待frame，忽视了时序上下文关系。为了克服这个问题，有人提出了cascade proposal and location anticipation model(CPLA)的方法，能够推理发生在两帧之间的运动趋势。用 frame $I_t$ 上检测到的bbox来推理 $I_{t+k}$ frame上对应的 bbox。$k$是 anticipation gap. 除了通过光流来获取视频里的运动特性之外，可以用 3D CNN 来从多个相邻帧提取运动信息。后续的还有用X3D网络、ACDnet、将光流和RGB嵌入到一个单流网络中，利用光流来modulate RGB特征、用SSD作为检测器、借鉴YOLO的YOWO：3D CNN来提取时空信息，2D model来提取空间信息、WOO：单个统一的网络，只用一个backbone来做actor localization和action classification、SE-STAD用FCOS作为目标检测器、EVAD用ViTs，通过drop out non-kkeyframe tokens减小计算开销，refine scenen context来增强模型性能。
frame-level的方法没有完全利用时序的信息，将视频帧视为独立的图像，因此提出了clip-level的STAD方法，将一系列的frames作为输入，直接输出检测到的 tubelet proposals(short sequence of bounding boxes)
Clip-level: 输入一个video clip，模型输出一个3D spatio-temporal tubelet proposals。3D tubelet proposals 是由一系列的bboxes that tightly bound the actions of interest 形成。然后这些tubelet proposals在successive clips连接在一起，形成完整的action tubes。作者从几个Large motion、Progressive learning、Anchor free、Visual Relation Modeling这几个角度给出了相关的算法。为了克服 3D anchors的 fixed spatial exntet的问题，有人提出了 two-frame micor-tubes的方法。为了避免3D cuboid anchor，也有人提出了通过frame-level actor detection，然后将detected bboxes连接起来形成class-independent action tubelets,然后给到temporal understanding module来做行为分类。还有 sparse-to-dense的方法。在progressive learning方面，通过progressive learning 方法，反复修正proposals towards actions over a few steps。有人提出了PCSC框架，以迭代的方式，用一个stream(RGB/Flow)里的region proposals和features来帮助另一个stream(RGB/Flow)提高action localization results。计算anchor是一个比较费劲的事情，提出了一些anchor-free的方法：有人把每个action instance 视为moving points的轨迹。
- MovingCenter Detector(MOC-detector)，它由3个branches组成：center-branch for instance center detection and action recognition; movement branch for movement estimation;box branch for spatial extent detection.
- VideoCapsuleNet：用3D conv along with capsules来学习必要的语义信息 for action detection and recognition。有一个定位的component，利用capsules得到的action representation for a pixel-wise localization of actions.
- TubeR：直接检测视频里的action tubelet，同时执行action localization和recognition from a single representation. 设计了一个tubelet-attention moduel 来model dynamic spatio-temporal nature of a video clip. TubeR学习了tubelet queries的集合，输出actio tubelets.

在Visual Relation Modeling方面，clip-level 的visual relations也被探索了，来增强STAD模型；有人提出了 long short-term relation network(LSTR)，获取short-term 和long-term relations in videos。具体地说： LSTR先产生3D bboxes(tubelets) in each video。然后通过spatio-temporal attention mechanism in each clip 来建模human-context interactions。推理long-term temporal dynamics across video clips via graph ConvNet in a cascaded manner。 actor tubelets和object proposals的特征然后被用于构建关系图，建模human-object manipulations and human-human interaction actions。

Linking up the Detection Results: actions会持续一段时间，通常跨很多帧和clips。在frame-level或者clip-level检测结果得到之后，很多方法用一个linking algorithm来detections across frames or clips连接起来，形成video-level的action tubes。
- linking up frame-level detection boxes：第一个frame-level action detection linking 算法是由Gkioxari提出的，他们假设两个相邻region proposals(bboxes)的空间范围有很好的重叠，且scores很高，有很大的可能性be linked。计算两个region proposals的linking score的公式为： \[s_c(R_t,R_{t+1})=s_c(R_t)+s_c(R_{t+1})+\lambda\cdot ov(R_t,R_{t+1}), Eq.(1)\]
$s_c$(R_i)是region proposal R_i的class specific score，$ov(R_i,R_j)$ 是$R_i$和$R_j$的 IoU(overlap)。$\lambda$是一个超参数，对IoU项进行加权，有些模型输出的bbox是带有actionness scores，这里就用actionness scores代替class-specific scores。计算出所有的linking scores之后，最优的path通过这个来搜索： \[\bar R_c^*=\underset{\bar R}{\text{argmax}}\frac{1}{T}\sum_{t=1}^{T-1}s_c(R_t,R_{t+1}), Eq.(2)\]

$\bar{R}_{c} = [R_{1},R_{2},\ldots,R_{T}]$ 是action class $c$的一系列的linked region. 通过维特比算法来解这个优化问题。找到最有的path之后，region proposals in $\bar{R}_{c}$ 从 set of region proposals中去掉，然后再继续解该方程，直到set of region proposals是空的。从Eq.(2)中计算得到的path被称为 action tube。 action tube $\bar{R}_{c}$ 定义为：$S_{c}(\bar{R}_{c})=\frac1T\sum_{t=1}^{T-1}s_{c}(R_{t},R_{t+1}).$
- 基于Gkioxari的思路，Peng等人提出了在 Eq.(1)增加一个阈值函数，linking score between two region proposals变成了： \[s_{c}(R_{t},R_{t+1})=s_{c}(R_{t})+s_{c}(R_{t+1})+\lambda\cdot ov(R_{t},R_{t+1})\cdot\psi(ov)\]
$\psi(ov)$是一个阈值函数，当$ov$大于$\tau$时，$\psi(ov)=1$，否则$\psi(ov)=0$。Peng在实验中发现，有了这个阈值函数，linking score比之前更好了更robust了。Kopuklu进一步扩展了这个linking score的定义：

\[\begin{aligned} s_{c}\left(R_{t},R_{t+1}\right)=& \psi(ov)\cdot[s_{c}\left(R_{t}\right)+s_{c}\left(R_{t+1}\right) \\ &+\alpha\cdot s_{c}\left(R_{t}\right)\cdot s_{c}\left(R_{t+1}\right) \\ &+\beta\cdot ov\left(R_{t},R_{t+1}\right)] , \end{aligned}\]

其中$\alpha$和$\beta$是超参数，$\alpha \cdot s_c(R_t)\cdot s_c(R_{t+1})$项将两个连续帧之间的dramatic change 考虑进去，提高video detection的性能
- 在Temporal trimming中，上述的linking 算法得到了action tubes 横跨整个video duration，然而，human actions 通常只占很小一部分。为了决定一个action instance的时间范围。有一些temporal trimming的工作。Saha限制了consecutive proposals来得到smooth actionness scores。通过动态规划解一个energy maximization的问题。Peng等人依赖一个高校的maximum subarray 方法：给定一个video-level action tube $\bar{R}$, 它的理想的时间范围是从frame $s$ to frame $e$，满足下列公式：
\[s_{c}(\bar{R}_{(s,e)}^{\star})=\underset{(s,e)}{\operatorname*{argmax}}\{\frac{1}{L_{(s,e)}}\sum_{i=s}^{e}s_{c}(R_{i})-\lambda\frac{|L_{(s,e)}-L_{c}|}{L_{c}}\},\]

$L_{(s,e)}$是 action tube 的长度，$L_c$是 class $c$在训练集中的平均时长。
- 在online action tube generation中，在视频的第一帧，用 $n$个detected bboxes来初始化 $n$ action tubes for each class $c$。然后，action tubes通过增加frame中的box扩大或者在$k$个连续帧之后没有匹配的boxes，就会终结。最后，每个更新的tube通过执行binary labeling using a online Viterbi算法来进行temporally trimmed。
- Linking up Clip-level Detection Results:clip-level tubelet linking算法旨在 associate a sequence of clip-level tubelets into video-level action tubes。它们通常是从frame-level box中得到。一个tubelet内的内容应该获取一个action，在任何两个连续的clips连接的tubelets应该有一个大的 temporal overlap ，因此，他们定义tubelet's的linking score是这样的： \[S=\frac{1}{m}\sum_{i=1}^{m}Actionness_{i}+\frac{1}{m-1}\sum_{j=1}^{m-1}Overlap_{j,j+1}\]
$Actionness_i$ 表示第$i$个clip的tubelet的actionness score。$Overlap_{j,j+1}$表示来自第$j$和第$j+1$个clip的两个proposals的Overlap。$m$是video clips的总数，两个tubelets之间的overlap是基于第$j$个tubelet的最后一帧和$j+1$个tubelet的第一帧来计算的。在计算出了tubelets' 分数之后；另外，有人把frame linking的算法扩展到tubelet linking 来构建action tubes。核心的idea是这样的： - 初始化：在视频的第一帧，对每个tubelet开始一个new link，这里a link 指a sequence of linked tubelets - Linking: 给定一个new frame $f$，扩展存在的links with one o the tubelet candidates starting at this frame. 选择tubelet candidate的标准如下：没有被其它links选择；有最高的action score；和要被扩展的link的overlap高于给定的阈值。 - 终止：对于一个存在的Link，如果这个标注在$K$个连续帧之后没有被满足，这个link就会终止，$K$是一个给定的超参数。

由于它的简单和高效，tubelet linking算法被很多最近的工作采用。

在temporal trimming方面。tubelet linking 算法，初始和终止决定了action tubes的时间范围，有人发现它不能彻底的解决transition state产生的temporal location error。定义为ambiguous states，但不属于target actions。为了解决这个问题，有人提出了transition-aware classifier：能够区分transitional states和real actions；后续也有人通过引入一个action switch regression head：决定一个box prediction是否描述了一个执行actions的actor。这个regression head给出了一个tubelet每个bbox的action switch score。如果这个score高于给定的阈值，这个box就包含这个action。这个action switch regression head能够有效减小transitional states的误分类。
数据集：STAD中经常用到的数据集有：
- Weizmann：在一个统一的背景中用一个静态相机记录，包含90个video clips grouped 10 action classes, performed by 9 diffferent subjects，每个video clip 包含多个单一行为的实例，空间分辨率为$180 $，每个clip是从1-5s。
- CMU Crowded Videos：包含5个 actions，每个action有5个training videos和48个test videos。所有的video被缩放，空间分辨率为$120 \times 160$，test videos是5-37秒(166-1115帧)，这个数据集是在一个凌乱的和动态的环境中记录的，以至于这个数据集上的action detection更加有挑战性。数据集是densely annotated，提供时间和空间坐标(x,y,height,width,start,end and frames)。
- MSR Action I and II：是微软研究组弄的，II是I的扩展，Action I包含62个action instances in 16个video sequences。II包含203 instances in 54 videos，每个video包含不同个体执行的多个actions。所有的视频是32-76秒，每个action instance的时间和空间标注是提供的。包含3个action 类别。
- J-HMDB：是joint-annotated HMDB数据集，HMDB包含5个action categories，每个category 包含至少101个视频片段，数据集包含6849个视频片段，分布在51个action categories中。J-HMDB包含从HMDB数据集中宣导的21类视频，选择的视频涉及单个任务的动作，每个action class有36-55个clips，每个clip包含15-40帧，总共928个clips，每个clip被裁剪了，第一帧和最后一帧对应一个action的开始和终止。frame的分辨率是 $320 \times 240$, frame rate 是30 fps。
- UCF Sports：包含体育领域的10 个actions，所有的视频包含相机运动和复杂背景，包含150个clips，每个clip的frame rate是10 fps，空间分辨率是 $480 \times 360$ to $720 \times 576$，持续时间是 2.2 - 14.4秒，平均6.39秒。
- UCF101-24：UCF101的数据来自Youtube，包含101行为类别，总共13320个视频。对于行为检测任务，包含24个行为类别的3207个视频子集提供了密集标注，这个子集称为UCF101-24，不同于UCF Sports和J-HMDB，视频是被剪过的，UCF101-24是没有被剪过的。
- THUMOS and MultiTHUMOS：THMOS系列数据集包含4个数据集：THUMOS13，THUMOS14,THUMOS15和MultiTHUMOS，所有的视频是来自UCF101，THUMOS数据集包含24个action classes，视频的时长从几秒到几分钟不等。数据集包含13000个被剪过的视频，超过1000个没有剪过的视频，超过2500个negative sample video。这些视频可能包含none、one、或者单个行为或者多个行为的实例。MultiTHUMOS是一个THUMOS的增强的版本，是一个dense、multi-class、frame-wise labeled video dataset with 400 videos of 30 hours和65个类别的38690个标注。平均每帧有1.5个标注，每个视频10.5个行为类别。
- AVA：来自Youtube的430个movies，每个movie提供了第15到30分钟的这个clip，每个clip分成897个重叠3s的segments with a stride of 1 second。对于每个segment，中间帧被选为keyframe，在每个keyframe，每个人都用bbox和actions标注，430个movies分成235个training，64个validation和131 test movies，差不多是55:15:30的比例，包含80个原子行为，60个actions 用来evaluation。
- MultiSports：这个视频来自Youtube上奥林匹克和世界杯的竞赛，包含4个运动，66个行为类别，每个运动800个clips，共3200个clips；包含37701个action instances with 902k个bboxes，每个行为类别的instance从3个到3477个不等，显示了自然的长尾分布。每个视频被多个行为类别的多个实例标注，视频的平均长度是750帧，每个行为的segment比较短，平均24帧。
评估指标：主要是两个：frame mAP和video-mAP
- Frame mAP: area under the PR curve of bbox detections at each frame.如果和GT bbox的IoU大于给定的阈值且action label是正确的，则detection是对的，阈值设为0.5。Frame-mAP能够独立于linking strategy来比较检测精度。
- Video-mAP：area under PR curve of action tube predictions。如果和GT tube的IoU大于给定的阈值且action label是正确的，则tube detection是对的，两个tubes之间的IoU被定义为时序上的IoU，multiplied by the average of the IoU between boxes averaged over all overlapping frames。 video-mAP的阈值通常设为0.2、0.5、0.75，and 0.5:0.95。对应于average video-mAP for thresholds with step 0.05 in this range. 然而frame-mAP衡量的是单帧里的分类和空间检测的能力，video-mAP能够进一步评估时序检测的能力。
未来的方向：
- Lable-efficient learning for STAD：STAD需要密集的标注，然而密集的标注是昂贵的。
- Online real-time STAD：STAD有很多的在线的应用，必须基于过去的数据来给出当前帧的预测。这要求模型必须是轻量和高效的，还有很长的路。
- STAD under large motion：在真实场景中，很多行为由于fast actor displacement，camera motion,actions有很大的motion。
- Multimodal learning for STAD：action video包含多个模态，包括视觉、声音甚至语言，因此，通过多模态学习，有潜力实现比单个模态更好的检测精度。另一方面，actions可以通过多种传感器得到，例如深度相机，红外相机，Lidar等，STAD或者可以从多个模态数据中学到的融合表征受益。
- Diffusion models for STAD：扩散模型作为一类生成模型，从 sample in 随机分布开始，通过逐步地去噪恢复样本数据。尽管它们属于生成模型，它们对于表征的感知任务(例如目标检测和时序动作定位)，表现有效,输入随机的spatial boxes(temporal proposals)，基于扩散的模型能够精确地产生目标框(action proposals)，自从STAD视为目标检测和时序动作定位(temporal action location)和结合体，有一些工作展示了利用diffusion models来解决STAD任务。