Young's Blog

Train With VideoMAE

发表于 2024-03-13 更新于 2024-04-09 分类于 methods 本文字数： 304 阅读时长 ≈ 1 分钟

对VideoMAE进行训练或者微调的遇到的Bug

训练videoMAE时报错， File "/home/MAE-Action-Detection/run_class_finetuning.py", line 404, in main train_stats = train_one_epoch( File "/home/MAE-Action-Detection/engine_for_finetuning.py", line 59, in train_one_epoch for step, (samples, boxes, _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): File "/home/MAE-Action-Detection/utils.py", line 141, in log_every for obj in iterable: File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) File "av/error.pyx", line 78, in av.error.FFmpegError.init TypeError: init() takes at least 3 positional arguments (2 given)

解决方法是：将torch, torchvision等相应的版本升级到1.13就行了，原来是1.9

AVA数据集格式的问题

参照AlphAction的格式，其中，boxs文件夹下的ava_det.json文件的中的bbox的格式默认是 x1,y1,w,h而不是x1,y1,x2,y2，所以后续的AVADataset里的有个地方，就是Box(mode="xyxy).convert("xywh")。如果bbox的格式是x1,y1,x2,y2，则不需要convert.("xywh")，如果是默认的x1,y1,w,h，则需要convert("xywh")。这是个大坑。。好像也没有看到作者有说明。。。

Slowfast

发表于 2024-03-11 更新于 2024-09-11 分类于 methods 本文字数： 845 阅读时长 ≈ 3 分钟

SlowFast Networks for Video Recognition^[1]

作者是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.论文引用[1]:Feichtenhofer, Christoph et al. “SlowFast Networks for Video Recognition.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2018): 6201-6210.

Time

2018.Dec

Key Words

Slow pathway to capture spatial semantics
lightweight Fast pathway to capture temporal motion and fine temporal resolution

动机

all spatiotemporal orientations are not equally likely, there is no reason for us to treat space and time symmetrically.
inspired by biological studies on the retinal ganglion cells in the primate visual system，受灵长类动物的视觉系统的视网膜神经节细胞的启发。一种Parvocellualr(P-cells)约80%，Magnocellualr(M-cells)约20%，
- M-cells operates at high temporal frequency \(\rightarrow\) fast temporal changes
- P-cells可以检测到空间信息：spatial detail and color, lower temporal resolution
  阅读全文 »

Training With Detectron2

发表于 2024-03-11 分类于 methods 本文字数： 64 阅读时长 ≈ 1 分钟

用Detectron2训练自己的目标检测数据集

主要是需要注册自己的数据集，然后使用数据集进行训练

from detectron2.data.datasets import register_coco_instances

register_coco_instances("train", {}, "json_annotation.json", "path/to/image/dir")
然后就是一些配置文件啥的

VideoMAE

发表于 2024-02-28 更新于 2025-04-29 分类于 Papers 本文字数： 984 阅读时长 ≈ 4 分钟

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining^[1]

作者是Zhan Tong, Yibing Song, Jue Wang 和王利民，分别来自南大，腾讯和上海AI Lab，论文引用[1]：Tong, Zhan et al. “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.” ArXiv abs/2203.12602 (2022): n. pag.

Time

Key Words

video masked autoencoder using plain ViT backbones, tube masking with high ratio
data-efficient learner that could be successfully trained with only 3.5k videos. Data quality more important than quantity for SSVP(self-supervised video pretraining) when a domain shift exists between source and target dataset.

动机

对于Video Transformers，通常是derived from 基于图像的transformer，严重依赖于从大规模图像数据的pre-trained models，高效地训练一个vanilla vision transformer on the video dataset without any pre-trianed model or extra image data是一个挑战。
阅读全文 »

DETR

发表于 2024-02-01 更新于 2024-08-12 分类于 Papers 本文字数： 785 阅读时长 ≈ 3 分钟

End-to-End Object Detection with Transformers^[1]

作者们是来自Facebook AI的Nicolas Carion, Francisco Massa等。论文引用[1]:Carion, Nicolas et al. “End-to-End Object Detection with Transformers.” ArXiv abs/2005.12872 (2020): n. pag.

Key Words:

a set of prediction loss(biparitte matching loss)
Transformer with parallel encoding

总结

以下“我们” 指代作者

提出了一个新的方法：将目标检测看作是直接的集合预测问题(set prediction problem)，精简了检测的pipeline，去掉了很多手工设计的组件，像是NMS非极大值抑制和anchor generation。新方法DEtection TRansformer (DETR)的主要的要素是 set-based global loss(通过两个部分的匹配(bipartite matching)强制唯一的预测)和transformer的encoder-decoder架构。给定一个固定的小的learned object queries的集合，DETR推理物体和global image context的关系，直接并行地输出最后预测的集合。在COCO目标检测数据集上，DETR展示了和Faster RCNN相当的精度和实时的性能。DETR能够很容易推广来产生全景的分割 in a unified manner。

阅读全文 »

SSD

发表于 2024-02-01 更新于 2024-03-03 分类于 Papers 本文字数： 152 阅读时长 ≈ 1 分钟

SSD: Single Shot MultiBox Detector^[1]

作者是 Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg，分别来自UNC Chapel Hill，Zoox Inc, Google, UMichigan。论文引用[1]:Liu, W. et al. “SSD: Single Shot MultiBox Detector.” European Conference on Computer Vision (2015).

Key Words

discretize output space of bboxes into a set of default boxes over different aspect ratios and scales.
combines predictions from multiple feature maps with different resolutions to handle objects of various sizes
multi-scale conv bbox outputs attached to multiple feataure maps at the top of the network
阅读全文 »

MAE

发表于 2024-02-01 更新于 2025-04-29 分类于 Papers 本文字数： 1.7k 阅读时长 ≈ 6 分钟

Masked Autoencoders Are Scalable Vision Learners^[1]

作者是来自FAIR的恺明、Xinlei Chen、Saining Xie等。论文引用[1]：He, Kaiming et al. “Masked Autoencoders Are Scalable Vision Learners.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 15979-15988.

以下“我们”指代作者

Time

2021.Nov

摘要

MAE：掩码自编码是可扩展的自监督学习器。思路：对输入图片的patches进行随机掩码，然后重构缺失的像素。两个core design：
- 非对称的encoder-decoder架构；encoder只对patches的visible subset进行操作。lightweight decoder从latent representation和mask tokens中重建原始图片。
- 对输入图片进行高比例掩码，例如75%，能够产生重要和有意义的自监督任务。
将两者进行耦合，能够有效和高校地训练大的模型。可扩展的方式能够学习high-capacity models，扩展性很好。普通的(vanilla) ViT-Huge模型在ImageNet-1K上达到87.8%的best accuracy。在下游的任务上迁移的能力超过了监督的预训练，展示出来promising scaling behavior。

总结：

阅读全文 »

VideMAEv2

发表于 2024-01-26 更新于 2024-04-09 分类于 Papers 本文字数： 843 阅读时长 ≈ 3 分钟

VideoMAEv2: Scaling Video Masked Autoencoders with Dual Masking^[1] 🎞️

作者们是来自南大 Novel Software Technology Lab、上海AI Lab和深圳先进院的团队，论文出处[1]: Wang, Limin, et al. "Videomae v2: Scaling video masked autoencoders with dual masking." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

总结：

以下 “我们”指作者

阅读全文 »

YOLO

发表于 2024-01-24 更新于 2024-03-01 分类于 Papers 本文字数： 631 阅读时长 ≈ 2 分钟

YOLO 系列论文

开头说几句题外话：这几天想了想，打算用Blog来记录一下看到的论文，给自己一个督促。现在AI发展日新月异，尤其是ChatGPT出来之后，各种新的论文太多了，都不知道从哪里开始看，有点眼花缭乱，思来想去，还是一步一步来，从经典论文开始，当然也会看新的热度很高的论文，通过这种方式，来一点一点的进步吧。不积跬步无以至千里；千里之行，始于足下。加油！！！只要想做，什么时候都不算晚！！🏃

You Only Look Once: Unified, Real-Time Object Detection^[1]🚀

作者是来自U of Washington、Allen Institute for AI和FAIR,包括Joseph Redmon、Santosh Divvalala、Ross Girshick 等。论文出处：[1]Redmon, Joseph et al. “You Only Look Once: Unified, Real-Time Object Detection.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015): 779-788.

总结：

阅读全文 »

Hello World

发表于 2024-01-24 更新于 2023-11-14 本文字数： 79 阅读时长 ≈ 1 分钟

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

阅读全文 »

对VideoMAE进行训练或者微调的遇到的Bug

SlowFast Networks for Video Recognition[1]

Time

Key Words

动机

用Detectron2训练自己的目标检测数据集

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining[1]

Time

Key Words

动机

End-to-End Object Detection with Transformers[1]

Key Words:

总结

SSD: Single Shot MultiBox Detector[1]

Key Words

Masked Autoencoders Are Scalable Vision Learners[1]

Time

摘要

总结：

VideoMAEv2: Scaling Video Masked Autoencoders with Dual Masking[1] 🎞️

总结：

YOLO 系列论文

You Only Look Once: Unified, Real-Time Object Detection[1]🚀

总结：

SlowFast Networks for Video Recognition^[1]

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining^[1]

End-to-End Object Detection with Transformers^[1]

SSD: Single Shot MultiBox Detector^[1]

Masked Autoencoders Are Scalable Vision Learners^[1]

VideoMAEv2: Scaling Video Masked Autoencoders with Dual Masking^[1] 🎞️

You Only Look Once: Unified, Real-Time Object Detection^[1]🚀