Slowfast
SlowFast Networks for Video Recognition[1]
作者是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.论文引用[1]:Feichtenhofer, Christoph et al. “SlowFast Networks for Video Recognition.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2018): 6201-6210.
Time
- 2018.Dec
Key Words
- Slow pathway to capture spatial semantics
- lightweight Fast pathway to capture temporal motion and fine temporal resolution
动机
- all spatiotemporal orientations are not equally likely, there is no reason for us to treat space and time symmetrically.
- inspired by biological studies on the retinal ganglion cells in the
primate visual
system,受灵长类动物的视觉系统的视网膜神经节细胞的启发。一种Parvocellualr(P-cells)约80%,Magnocellualr(M-cells)约20%,
- M-cells operates at high temporal frequency \(\rightarrow\) fast temporal changes
- P-cells可以检测到空间信息:spatial detail and color, lower temporal resolution ### 总结
- 光流是手工设计的representation,two-stream methods不能端到端的与flow一起学习
- \(\alpha\) 是Slow 和Fast pathway的 framte rate ratio, \(\alpha\) > 1, is the key of SlowFast. Fast pathway has a ratio of \(\beta\) < 1 channels of the Slow pathway.因此Fast pathway的计算量要小一些。
- 用双边连接来fuse 两个pathway的信息,由于2个pathway有不同的temporal dimensions,因此需要进行transformation. Fast pathway 没有temporal downsampling layers。use non-degenerate temporal convolution。
- AVA Detection:
- 在res5的最后一个特征图抽取ROI features, 将2D RoI at a frame 扩展到3D RoI by replicating it along temporal axis.然后通过RoIAlign计算RoI features,进行global average pooling temporally. RoI features经过max-pooled之后,fed to a per-class, sigmoid-based classifier for multi-label prediction
- 作者这里用的off-the-shelf detector: 用Dectron来训练一个person-detector, ResNeXt-101-FPN + Faster R-CNN backbone。在ImageNet和COCO human keypoint images上进行预训练,然后再AVA 上进行person detection 微调。然后,region proposals for action detection are detected person boxes with a confidence of > 0.8
\(Fig.1^{[1]}\) A SlowFast network has a low frame rate, low temporal resolution Slow pathway and a high frame rate, α× higher temporal resolution Fast pathway. The Fast pathway is lightweight by using a fraction (β, e.g., 1/8) of channels. Lateral connections fuse them.
用Slowfast 进行action recognition
踩了很多坑,终于能有结果了,或者说反馈了,能看到识别到动作,但是还存在很多问题。
Bugs
- libstdc++.so.6: version `GLIBCXX_3.4.20' not found
按照stackoverflow上的说法,answer1
和answer2,由于环境中有GLIBCXX_3.4.20,所以最后用export LD_LIBRARY_PATH=/path/to/lib:$LD_LIBRARY_PATH
解决了
- 训练完slowfast,但是推理的时候,没有任何结果
在配置文件.yaml里DEMO那个地方,需要设置已经训练好的针对所要检测物体的weights和yaml
构造AVA数据集
- 数据集的格式和要求
下次弄完了再更
- AVA的Google官网没有提供视频的下载,有视频下载的链接:
trainval/test:
https://github.com/ItzJuny/Download-AVA_Kinetics-and-AVA_Actions/tree/master/ava-actions,
这个链接也可以,还有后续的处理步骤:https://github.com/yjh0410/AVA_Dataset,
在YOWOv3这个仓库里,有ucf24和AVA的下载链接,https://github.com/Hope1337/YOWOv3
VideoMAE-Action-Detection处理的ava和其他的不太一样:https://github.com/MCG-NJU/VideoMAE-Action-Detection/blob/main/DATASET.md
SlowFast的处理的AVA:https://github.com/facebookresearch/video-long-term-feature-banks/blob/main/DATASET.md
AVA格式数据集制作的相关参考链接:
https://blog.csdn.net/WhiffeYF/article/details/124358725?spm=1001.2014.3001.5502
https://blog.csdn.net/weixin_43720054/article/details/126298006
https://github.com/Whiffe/Custom-ava-dataset_Custom-Spatio-Temporally-Action-Video-Dataset
https://blog.csdn.net/lanyan90/article/details/125796563?spm=1001.2014.3001.5502
https://blog.csdn.net/WhiffeYF/article/details/115375949?spm=1001.2014.3001.5501