EVAD
Efficient Video Action Detection with Token Dropout and Context Refinement[1]
作者是来自nju、蚂蚁集团、复旦和上海AI Lab的Lei Chen、Zhan Tong、Yibing Song等人。论文引用[1]:Chen, Lei et al. “Efficient Video Action Detection with Token Dropout and Context Refinement.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023): 10354-10365.
Time
- 2023.Aug
Key Words
- spatiotemporal token dropout
- maintain all tokens in keyframe representing scene context
- select tokens from other frames representing actor motions
- drop out irrelavant tokens.
总结
- 视频流clips with large-scale vieo tokens 阻止了ViTs for efficient recognition,特别是在video action detection领域,这是需要大量的时空representations来精确地actor identification。这篇工作,提出了端到端的框架 for efficient video action detection(EVAD) based on vanilla ViTs。EVAD包含两个为视频行为检测的特殊设计。首先:提出来时空token dropout from a keyframe-centric perspective. 在一个video clip中,main all tokens from its keyframe,保留其它帧中和actor motions相关的tokens。第二:通过利用剩余的tokens,refine scene context for better recognizing actor identities。action detector中的RoI扩展到时间域。获得的时空actor identity representations are refined via scene context in a decoder with the attention mechanism。这两个设计使得EVAD高效的同时保持精度。