AVA Dataset
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions[1]
作者是来自Google Research、Inria Laboratoire Jean Kuntzmann, Grenoble, France, UC Berkeley的Chunhui Gu、Chen Sun、David A.Ross等人。论文引用[1]:Gu, Chunhui et al. “AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 6047-6056.
Time
- 2017.May
Key Words
- aotmic visual actions rather than composite actions
- precise spatio-temporal annotations with possibly multiple annotations for each person
- exhaustive annotation of these atomic actions over 15-minute video clips
- people temporally linked across consecutive segments
总结
dataset is sourced from 15th - 30th minute time intervals of 430 different movies, which given 1 Hz sampling frequency gives us nearly 900 keyframes for each movie. In each keyframe, every person is labeled with (possibly multiple) actions from AVA vocabulary. Each person is linked to the consecutive keyframes to provide short temporal sequences of action labels.
对keyframes按照1Hz的频率进行标注,这样的密度足以捕获完整的动作的语义内容,而且能够使我们避免行为边界的不切实际的精准的时间标注。 THUMOS Challenge observerd: 行为边界是 inherently fuzzy, 导致严重的inter-annotator disagreement. 相比之下,AVA的annotator 能够容易地判断a frame是否包含一个给定的action(\(\pm 1.5s\) of context)。AVA localizes action start and end points to an acceptable precision of \(\pm 0.5 s\)
Person-centric action time series. focus on the activities of people, treated as single agents.there could be multiple people as in sports or two people hugging, 但是每个人is an agent with individual choices, treat each separatly.
视频中的tubelets 对应图像中的bbox。Methods for spatio-temporal action localization: 之前的方法是依赖于目标检测器在frame-level来判别行为类型。
- build upon the idea of spatio-temporal tubes, but employ I3D 和 Faster RCNN region proposals to outperform.
AVA Dataset的标注的5个步骤:
- action vocabulary generation
- movie and segment selection
- person bounding box annotation
- person linking and action annotation
Each 15-min clip is then paritioned into 897 overlapping 3s movie segments with a stride of 1 second.
标注bbox的时候:先用Faster RCNN person detector检测,然后annotator标注剩下的bboxes missed by detector. Person link annotation也类似,先用person embedding with Hungarian 算法来做匹配,然后在human annotator remove false positives.
Each video clip is annotated by 3 independent annotators, only regard an action label as ground truth if it is verified by at least 2 annotators. Annotators are shown segments in randomized order.
AVA action annotations distribution roughly follows Ziph's law.
作者提出的Actino Localization Model: -spatio-temporal tubes
- operate on multi-frame temporal information
- use I3D with Faster RCNN.I3D to model temporal context. Faster RCNN for end-to-end localization and classification of actions and get region proposals.
\(Fig.1^{[1]}\). Illustration of our approach for spatio-temporal action localization. Region proposals are detected and regressed with Faster-RCNN on RGB keyframes. Spatio-temporal tubes are clas sified with two-stream I3D convolutions.