Holistic Interaction Transformer
Holistic Interaction Transformer Network for Action Detection[1]
作者是来自国立清华大学和微软AI的Gueter Josmy Faure, Min-Hung Chen和Shang-Hong Lai.论文引用[1]:Faure, Gueter Josmy et al. “Holistic Interaction Transformer Network for Action Detection.” 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022): 3329-3339.
Time
- 2022.Nov.18
Key Words
- bi-modal structure
- combine different interactions
总结
- 行为是关于我们如何与环境互动的,包括其他人、物体和我们自己。作者提出了一个新的多模态的Holistic Interaction Transformer Network (HIT),利用大量被忽略的、但是对人类行为重要的手部和姿态信息。HIT网络是一个全面的bi-modal框架,由一个RGB stream和pose stream组成。每个stream独立地建模person、object和hand interactions,对于每个子网络,用了一个Intra-Modality Aggregation module(IMA),选择性地融合个体的交互。从每个模态的得到的features然后用一个Attentive Fusion Mechanism(AFM)进行融合,最后,从temporal context中提取cues,用cached memory来更好地分类存在的行为。
一个合理的时空行为检测的框架旨在正确地label 帧里的每个人。应该在相邻帧之间keep a link 来更好地理解带有连续性特点的活动例如:"open"、"close"。最近,更多鲁棒的工作来考虑spatial entities之间的关系,因为如果两个人在同一帧,他们大概率会和彼此互动。然而,仅用person features是不足以获得object-related action。其他人try to understand 不仅是帧里的人物之间的关系,而且还有他们周围的物体。这些方法有两个主要的缺点:(1)他们仅依赖于high detection confidence的目标,可能会导致忽略重要的没有检测到的目标;(2)这些模型努力检测没有出现在frame里的和目标相关的actions。例如:考虑到行为:"point to (an object)",actor所指向的object可能不在当前帧中。
HIT网络用细粒度的上下文,包括person pose,hands,objects,来构建一个 bi-modal interaction structure ,每个modality包括3个主要的components: person interaction , object interaction ,hand interaction ,每个component学习有价值的local action patterns。在从相邻帧学习时序信息之前,用Attentive Fusion Mechanism 结合不同模态的信息,来帮助更好地检测发生在当前帧中的行为。主要的contributions有:
- 提出了新的框架、结合RGB、pose和hand features for action detection
- 引入了一个bi-modal HIT网络,结合不同的interaction in an intuitive and meaningful way
- 提出了一个Attentive Fusion Moduel,作为一个选择性地过滤器,保持每个modal中信息量最多的features,Intra-Modality Aggregator用来学习modalities内有用的action representations.
Related work:
spatio-temporal action detection,不同于将整个视频分为1类,需要在空间上和时间上检测到行为。很多近期的spatio-temporal action detection上的工作是用3D CNN作为backbone来提取视频特征,然后用ROI pooling或者ROI align来crop person features from video features,这么做,抛弃了视频中潜在的有用信息。
时空行为检测任务实际是一个交互建模的任务,大多数的行为是在和环境互动。很多研究是用attention mechanism,有人提出了Temporal Relation Network(TRN),学习帧间的依赖,或者说,interaction between entities from相邻帧。其它的方法进一步,不仅建模temporal,也建模spatial interactions between different entities from the same frame. 然而,选择什么entities来model interactions也是因model而异的,不只用human features,也有人用background information来model interactions between the person in the frame and context. 选择crop persons' features,但是不抛弃剩下的background features,这样提供了丰富的信息,但是,可能会引入很多噪音。也有人尝试be more selective about the features to use. 有人首先pass the video frames through an object detecotr, crop both the object and person features,model它们的interactions。interaction的额外的层,提供了更好的representations than 单独的human interaction modeling models,能够helps with classes related to objects such as "work on a computer",然而,当目标太小没有检测到或者没有出现在frame中的时候,这些方法会fall short。
很多最近的action detection frameworks仅用RGB features,也有人用光流来获取motion,有人用inception-like model来concatenates RGB和flow features at the Mixed4b layer ,然而也有人用I3D 网络来分别得到RGB和flow features,然后在action classifier之前,concatenate两个模态。作者提出的bi-modal方法,用了 visual and skeleton-based features ,每个modality计算一系列的interactions including person、object、and hands before being fused。一个temporal interaction module用在fused features上,来学习global information regarding neighboring frames.
Methods: HIT由RGB和pose subnetwork组成,每个旨在学习persions interactions with their surroundings by focusing on the key entities that drive most of our actions。在融合了两个sub-networks的输出之后,进一步通过looking at cached features from past and future frames,model how actions evolve in time。这样全面的活动的理解方案能够帮助实现更好的行为检测性能。几个步骤:entity selection、RGB modality、pose modality、Attentive Fusion Module(AFM)、Temporal Interaction Module.
- entity selection: HIT由两个mirroring modalities with distinct modules组成,来学习不同类型的interactions。Human actions大部分基于它们的pose、hand movements和interactions with their surroundings。基于这些观察,选择human poses 和hands bboxes作为模型的entities along with object and person bboxes。用Detectron for human pose detection,然后create a bbox 包围location of the person's hands. 用 Faster-RCNN 来计算object bbox proposals。 Video feature extractor是一个 3D CNN backbone。pose encoder是一个轻量的spatial transformer,用ROIAlign 来trim video features,来抽取person、hands和object features。
- RGB branch:RGB branch包含3个components,每个包含一系列的操作,来学习目标person的特定的信息。这个object和hands interaction modules model person-object and person-hands interaction。person interaction moduel 学习当前帧之间的persons的interaction。在每个interaction unit 的heart,是要给cross-atttentin computation,这个query是target person(or the output of the previous unit), key and value are derived from the objects, or hands features,depending on which moduele we are at. it is like asking "how can these particular features help detect what the target person is doing?",公式如下: \[F_{rgb}=(A(\mathcal{P})\to z_{r}\to A(\mathcal{O})\to z_{r}\to A(\mathcal{H})\to z_{r})\\A(*)=softmax(\frac{w_q(\widetilde{P})\times w_k(*)}{\sqrt{d_r}})\times w_v(*)\\z_{r}=\sum_{b}A(b)\times softmax(\theta_{b}),b\in(\widetilde{P},\mathcal{O},\mathcal{H},\mathcal{M})\]
\(d_r\) 代表RGB features的dimension channel, \(w_q\)、\(w_k\)、\(w_v\) project their inputs into query,key and value,A(*)是cross-attention机制,only takes person features as input when computing person interaction \(A(P)\),对于hand interaction(objects interaction):只有两个sets的输入:output of \(z_r\) which serves as query(\(\bar{P}\))、hands features(object features) from which we obtain the key and values.
\(Z_r\)是所有interaction modules的加权和,包括termporal interaction module \(TI\),\(Z_r\)很重要,首先,它允许网络aggregate 尽可能多的信息;另外可学习的参数\(\theta\)帮助过滤不同sets的features,hand-picking the best each of them has to offer while discarding noisy and unimportant information。
- Pose branch:pose model类似于它的RGB counterpart,reuses most of its outputs。首先通过一个轻量的transformer encoder \(f\)来抽取pose features \(K'\) \[\mathcal{K}^{\prime}=f(\mathcal{K})\]
然后通过mirroring RGB modality的不同的constituents来计算 \(F_pose\),然后reusing 对应的outputs,\(P'\)、\(O'\)、\(H'\)是对应的outputs of \(A(P),A(O),A(H)\)。 \[F_{pose}=(A(\mathcal{K}^{\prime},\mathcal{P}^{\prime})\to z_{p}\to A(\mathcal{O}^{\prime})\to z_{p}\to A(\mathcal{H}^{\prime})\to z_{p})\\A(\mathcal{K}^{\prime},\mathcal{P}^{\prime})=softmax(\frac{w_{q}(\mathcal{K}^{\prime})\times w_{k}(\mathcal{P}^{\prime})}{\sqrt{d_{p}}})\times w_{v}(\mathcal{P}^{\prime})\]
\(A(\mathcal{K}^{\prime},\mathcal{P}^{\prime})\)计算cross-attention between pose features \(K'\)和增强的person interaction features \(P'\),这样的cross-modal blend enforces the pose features by focusing on the key corresponding attributes of the RGB features。其它的components,\(A(O')\)和\(A(H')\)对 \(Z_p\)做一个线性的projection as query while their key-value pairs stem from \(A(O) and A(H)\),\(Z_p\)是要给intra-modality aggregation component for the pose model,类似于 \(Z_r\),它过滤和汇聚了每个interaction module的信息。
Attentive Fusion Module(AFM):在进入action分类器之前,RGB和pose stream需要结合到一个set of features。出于这个目的,提出了Attention Fusion Module,用channel-wise concatenation of two features sets followed by self-attention for feature refinement。通过用project matrix \(\Theta_{fused}\)来减小输出特征的值。 \[F_{fused}=\Theta_{fused}(SelfAttention(F_{rgb},F_{pose}))\]
Temporal Interaction Unit:fusion module之后,就是一个temporal interaction block \(TI\),human actions是连续发生的,因此,long-term context对于理解行为是重要的,along with \(F_fused\),这个Module receives 压缩的 memory data \(M\) with length \(2S+1\),memory cache包含了video backbone得到的person features. \(F_fused\) inquirs \(M\) as to which of the neighboring frames contains informative features, 然后absorb。\(TI\)是另一个 cross-attention module where \(F_fused\)是query,memory \(M\)两个不同的projections构成了key-value pairs. \[F_{cls}=TI(F_{fused},\mathcal{M})\] 最后,分类头 \(g\)是有两个 feed-forward layers with relu activation 和output layer组成的, \[\hat{y}=g(F_{cls})\]
实验:
person和object detector: 从数据集中的每个视频抽取keyframes,然后用detected person bbox from (YOWO[16]) fo inference。作为一个object detector,用FasterRCNN with ResNet-50-FPN作为backbone,模型是在ImageNet上进行的预训练,然后再MSCOCO上微调。
Keypoints Detection and Processing:对于keypoints detection,采用 \(Detectron\)中的pose model,用在ImageNet for object detection上预训练 and fine-tuned on MSCOCO keypoints using precomputed RPN proposals的ResNet-50-FPN,target dataset中的每个keyframe通过model,输出17个keypoints for each detected person corresponding to the COCO format.进一步后处理这些检测到的pose coordinates,因此match GT person bboxes(during training) and bboxes from 16。对于person hands location,仅对人的手腕的部位的keypoints感兴趣,因此,对这个keypoints做一个bboxes,来highlight 人物的hands和everything in between。
用SlowFast网络作为视频的backbone
Limitations: 该框架依赖于离线的detector和pose estimator,因此detector和pose estimator的精度可能会对这个方法有影响。similar-looking classes例如:"throw"和"catch"看起来很像;第二个是部分的occlusion。
\(Figure \ 1^{[1]}\): Overview of our HIT Network. On top of our RGB stream is a 3D CNN backbone which we use to extract video features. Our pose encoder is a spatial transformer model. We parallelly compute rich local information from both sub-networks using person, hands, and object features. We then combine the learned features using an attentive fusion module before modeling their interaction with the global context
\(Figure \ 2^{[1]}\): Illustration of the Interaction module. ∗ refers to the module-specific inputs while Pe refers to the person features in \(A(P)\) or the output of the module that comes before \(A(∗)\)
\(Figure \ 3^{[1]}\): Illustration of the Intra-Modality Aggregator. Features from one unit to the next are first augmented with contextual cues then filtered