Young's Blog

BEiT

发表于 2024-03-18 更新于 2025-04-29 分类于 Papers 本文字数： 923 阅读时长 ≈ 3 分钟

BEiT: BERT Pre-Training of Image Transformers^[1]

论文的作者是来自哈工大和微软的Hangbo Bao, Li Dong, Songhao Piao 和 Furu Wei。论文引用：Bao, Hangbo et al. “BEiT: BERT Pre-Training of Image Transformers.” ArXiv abs/2106.08254 (2021): n. pag.

Time

2021.Jun

Key Words

Self-supervised vision representation model:BEiT
pre-training task: masked image modeling(MIM)
two views of image representation: image patches(input) and visual tokens(output)

针对的问题

直接将BERT的思路用于image data有挑战：
- there is no pre-exist vocabulary for ViT's input unit,i.e. image patches.所以不能简单地用一个softmax分类器来预测所有可能的masked patches的candidates.
- 一个直接的思路是将任务视为回归问题，预测masked patches的raw pixels，然后pixel-level 的恢复任务tends to waster modeling capability on pre-training shortrange dependencies and high-frequency details.
  阅读全文 »

MAE as Spatiotemporal Learner

发表于 2024-03-13 更新于 2025-04-29 分类于 Papers 本文字数： 372 阅读时长 ≈ 1 分钟

Masked Autoencoders As Spatiotemporal Learners^[1]

作者们是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He。论文引用[1]:Feichtenhofer, Christoph et al. “Masked Autoencoders As Spatiotemporal Learners.” ArXiv abs/2205.09113 (2022): n. pag.

Key Words

extension MAE for video
minial domain knowledge

阅读全文 »

Train With VideoMAE

发表于 2024-03-13 更新于 2024-04-09 分类于 methods 本文字数： 304 阅读时长 ≈ 1 分钟

对VideoMAE进行训练或者微调的遇到的Bug

训练videoMAE时报错， File "/home/MAE-Action-Detection/run_class_finetuning.py", line 404, in main train_stats = train_one_epoch( File "/home/MAE-Action-Detection/engine_for_finetuning.py", line 59, in train_one_epoch for step, (samples, boxes, _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): File "/home/MAE-Action-Detection/utils.py", line 141, in log_every for obj in iterable: File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) File "av/error.pyx", line 78, in av.error.FFmpegError.init TypeError: init() takes at least 3 positional arguments (2 given)

解决方法是：将torch, torchvision等相应的版本升级到1.13就行了，原来是1.9

AVA数据集格式的问题

参照AlphAction的格式，其中，boxs文件夹下的ava_det.json文件的中的bbox的格式默认是 x1,y1,w,h而不是x1,y1,x2,y2，所以后续的AVADataset里的有个地方，就是Box(mode="xyxy).convert("xywh")。如果bbox的格式是x1,y1,x2,y2，则不需要convert.("xywh")，如果是默认的x1,y1,w,h，则需要convert("xywh")。这是个大坑。。好像也没有看到作者有说明。。。

Slowfast

发表于 2024-03-11 更新于 2024-09-11 分类于 methods 本文字数： 845 阅读时长 ≈ 3 分钟

SlowFast Networks for Video Recognition^[1]

作者是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.论文引用[1]:Feichtenhofer, Christoph et al. “SlowFast Networks for Video Recognition.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2018): 6201-6210.

Time

2018.Dec

Key Words

Slow pathway to capture spatial semantics
lightweight Fast pathway to capture temporal motion and fine temporal resolution

动机

all spatiotemporal orientations are not equally likely, there is no reason for us to treat space and time symmetrically.
inspired by biological studies on the retinal ganglion cells in the primate visual system，受灵长类动物的视觉系统的视网膜神经节细胞的启发。一种Parvocellualr(P-cells)约80%，Magnocellualr(M-cells)约20%，
- M-cells operates at high temporal frequency \(\rightarrow\) fast temporal changes
- P-cells可以检测到空间信息：spatial detail and color, lower temporal resolution
  阅读全文 »

Training With Detectron2

发表于 2024-03-11 分类于 methods 本文字数： 64 阅读时长 ≈ 1 分钟

用Detectron2训练自己的目标检测数据集

主要是需要注册自己的数据集，然后使用数据集进行训练

from detectron2.data.datasets import register_coco_instances

register_coco_instances("train", {}, "json_annotation.json", "path/to/image/dir")
然后就是一些配置文件啥的

VideoMAE

发表于 2024-02-28 更新于 2025-04-29 分类于 Papers 本文字数： 984 阅读时长 ≈ 4 分钟

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining^[1]

作者是Zhan Tong, Yibing Song, Jue Wang 和王利民，分别来自南大，腾讯和上海AI Lab，论文引用[1]：Tong, Zhan et al. “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.” ArXiv abs/2203.12602 (2022): n. pag.

Time

Key Words

video masked autoencoder using plain ViT backbones, tube masking with high ratio
data-efficient learner that could be successfully trained with only 3.5k videos. Data quality more important than quantity for SSVP(self-supervised video pretraining) when a domain shift exists between source and target dataset.

动机

对于Video Transformers，通常是derived from 基于图像的transformer，严重依赖于从大规模图像数据的pre-trained models，高效地训练一个vanilla vision transformer on the video dataset without any pre-trianed model or extra image data是一个挑战。
阅读全文 »

DETR

发表于 2024-02-01 更新于 2024-08-12 分类于 Papers 本文字数： 785 阅读时长 ≈ 3 分钟

End-to-End Object Detection with Transformers^[1]

作者们是来自Facebook AI的Nicolas Carion, Francisco Massa等。论文引用[1]:Carion, Nicolas et al. “End-to-End Object Detection with Transformers.” ArXiv abs/2005.12872 (2020): n. pag.

Key Words:

a set of prediction loss(biparitte matching loss)
Transformer with parallel encoding

总结

以下“我们” 指代作者

提出了一个新的方法：将目标检测看作是直接的集合预测问题(set prediction problem)，精简了检测的pipeline，去掉了很多手工设计的组件，像是NMS非极大值抑制和anchor generation。新方法DEtection TRansformer (DETR)的主要的要素是 set-based global loss(通过两个部分的匹配(bipartite matching)强制唯一的预测)和transformer的encoder-decoder架构。给定一个固定的小的learned object queries的集合，DETR推理物体和global image context的关系，直接并行地输出最后预测的集合。在COCO目标检测数据集上，DETR展示了和Faster RCNN相当的精度和实时的性能。DETR能够很容易推广来产生全景的分割 in a unified manner。

阅读全文 »

SSD

发表于 2024-02-01 更新于 2024-03-03 分类于 Papers 本文字数： 152 阅读时长 ≈ 1 分钟

SSD: Single Shot MultiBox Detector^[1]

作者是 Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg，分别来自UNC Chapel Hill，Zoox Inc, Google, UMichigan。论文引用[1]:Liu, W. et al. “SSD: Single Shot MultiBox Detector.” European Conference on Computer Vision (2015).

Key Words

discretize output space of bboxes into a set of default boxes over different aspect ratios and scales.
combines predictions from multiple feature maps with different resolutions to handle objects of various sizes
multi-scale conv bbox outputs attached to multiple feataure maps at the top of the network
阅读全文 »

MAE

发表于 2024-02-01 更新于 2025-04-29 分类于 Papers 本文字数： 1.7k 阅读时长 ≈ 6 分钟

Masked Autoencoders Are Scalable Vision Learners^[1]

作者是来自FAIR的恺明、Xinlei Chen、Saining Xie等。论文引用[1]：He, Kaiming et al. “Masked Autoencoders Are Scalable Vision Learners.” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021): 15979-15988.

以下“我们”指代作者

Time

2021.Nov

摘要

MAE：掩码自编码是可扩展的自监督学习器。思路：对输入图片的patches进行随机掩码，然后重构缺失的像素。两个core design：
- 非对称的encoder-decoder架构；encoder只对patches的visible subset进行操作。lightweight decoder从latent representation和mask tokens中重建原始图片。
- 对输入图片进行高比例掩码，例如75%，能够产生重要和有意义的自监督任务。
将两者进行耦合，能够有效和高校地训练大的模型。可扩展的方式能够学习high-capacity models，扩展性很好。普通的(vanilla) ViT-Huge模型在ImageNet-1K上达到87.8%的best accuracy。在下游的任务上迁移的能力超过了监督的预训练，展示出来promising scaling behavior。

总结：

阅读全文 »

VideMAEv2

发表于 2024-01-26 更新于 2024-04-09 分类于 Papers 本文字数： 843 阅读时长 ≈ 3 分钟

VideoMAEv2: Scaling Video Masked Autoencoders with Dual Masking^[1] 🎞️

作者们是来自南大 Novel Software Technology Lab、上海AI Lab和深圳先进院的团队，论文出处[1]: Wang, Limin, et al. "Videomae v2: Scaling video masked autoencoders with dual masking." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

总结：

以下 “我们”指作者

阅读全文 »

BEiT: BERT Pre-Training of Image Transformers[1]

Time

Key Words

针对的问题

Masked Autoencoders As Spatiotemporal Learners[1]

Key Words

对VideoMAE进行训练或者微调的遇到的Bug

SlowFast Networks for Video Recognition[1]

Time

Key Words

动机

用Detectron2训练自己的目标检测数据集

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining[1]

Time

Key Words

动机

End-to-End Object Detection with Transformers[1]

Key Words:

总结

SSD: Single Shot MultiBox Detector[1]

Key Words

Masked Autoencoders Are Scalable Vision Learners[1]

Time

摘要

总结：

VideoMAEv2: Scaling Video Masked Autoencoders with Dual Masking[1] 🎞️

总结：

BEiT: BERT Pre-Training of Image Transformers^[1]

Masked Autoencoders As Spatiotemporal Learners^[1]

SlowFast Networks for Video Recognition^[1]

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining^[1]

End-to-End Object Detection with Transformers^[1]

SSD: Single Shot MultiBox Detector^[1]

Masked Autoencoders Are Scalable Vision Learners^[1]

VideoMAEv2: Scaling Video Masked Autoencoders with Dual Masking^[1] 🎞️