Unconditional Generation
Return of Unconditional Generation: A Self-supervised Representation Generation Method[1]
作者是来自MIT, CSAIL的Tianhong Li, Dina Katabi和何恺明。论文引用:Li, Tianhong et al. “Return of Unconditional Generation: A Self-supervised Representation Generation Method.” (2023).
对物体图像语义信息本质的理解,而不是停留在图像的模式、特征上,增强泛化性。由小到大,由细微到广大、从局部到整体的去理解、去学习特征表示。
Key Words
- unconditional generation with unlabeled data.
- self-supervised encoder: MoCov3 ViT-B
- Representation Generation: RDM 12-block, 1536-hid-dim for 100 epochs
- Image generation: MAGE-B for 200 epochs
- Representation-Conditioned Generation(RCG)
- generate semantic representations in the representation space ### 总结
生成模型作为无监督方法发展了很长时间,重要的工作如GAN、VAE和Diffusion Model。这些基础方法聚焦于数据的概率分布,不依赖于人类标注的availability。这个问题经常归类为无条件的生成(unconditional generation),是追求利用大量的无标注数据来学习复杂的分布。缩小有条件和无条件的生成是一个有价值的问题。释放出大规模的无标注数据的能量是必须的一步。
DDDM
Deconstruting Denoising Diffusion Models for Self-supervised Learning[1]
作者是Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He,分别来自FAIR和NYU。论文引用:Chen, Xinlei et al. “Deconstructing Denoising Diffusion Models for Self-Supervised Learning.” ArXiv abs/2401.14404 (2024): n. pag.
Key Words
- Denoising Diffusion Models
- Denoising Autoencoder
- low-dimensional latent space
总结
- Denoising
是目前生成模型的核心,例如DDM,这些生成模型效果很好,看起来对于视觉内容有学习表征的能力。两个问题:
- 目前研究DDMs的表征能力是用off-the-shelf的预训练的DDMs,这个原本是用来生成的,现在用来评估识别的表征;
- 不清楚表征能力是通过denoising-driven过程还是diffusion-driven过程得到的。
- 文章的思路是:deconstruct DDM,将它逐步改成经典的DAE,通过这个过程检验它的各个方面。发现主要的一个component是tokenizer:create a low-dimensional latent space。the role of using multiple levels of noise is analogous to a form of data augmentation
闪现的想法
有没有能够可视化代码函数的执行和调用的工具,看一些大的Project,函数的调用看起来比较乱,记不住、理不清关联。
双视角, 三维轨迹
BEiT
BEiT: BERT Pre-Training of Image Transformers[1]
论文的作者是来自哈工大和微软的Hangbo Bao, Li Dong, Songhao Piao 和 Furu Wei。论文引用:Bao, Hangbo et al. “BEiT: BERT Pre-Training of Image Transformers.” ArXiv abs/2106.08254 (2021): n. pag.
Time
- 2021.Jun
Key Words
- Self-supervised vision representation model:BEiT
- pre-training task: masked image modeling(MIM)
- two views of image representation: image patches(input) and visual tokens(output)
针对的问题
直接将BERT的思路用于image data有挑战:
- there is no pre-exist vocabulary for ViT's input unit,i.e. image patches.所以不能简单地用一个softmax分类器来预测所有可能的masked patches的candidates.
- 一个直接的思路是将任务视为回归问题,预测masked patches的raw pixels,然后pixel-level 的恢复任务tends to waster modeling capability on pre-training shortrange dependencies and high-frequency details.
MAE as Spatiotemporal Learner
Masked Autoencoders As Spatiotemporal Learners[1]
作者们是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He。论文引用[1]:Feichtenhofer, Christoph et al. “Masked Autoencoders As Spatiotemporal Learners.” ArXiv abs/2205.09113 (2022): n. pag.
Key Words
extension MAE for video
minial domain knowledge
Train With VideoMAE
对VideoMAE进行训练或者微调的遇到的Bug
- 训练videoMAE时报错, File "/home/MAE-Action-Detection/run_class_finetuning.py", line 404, in main train_stats = train_one_epoch( File "/home/MAE-Action-Detection/engine_for_finetuning.py", line 59, in train_one_epoch for step, (samples, boxes, _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): File "/home/MAE-Action-Detection/utils.py", line 141, in log_every for obj in iterable: File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) File "av/error.pyx", line 78, in av.error.FFmpegError.init TypeError: init() takes at least 3 positional arguments (2 given)
解决方法是:将torch, torchvision等相应的版本升级到1.13就行了,原来是1.9
- AVA数据集格式的问题
参照AlphAction的格式,其中,boxs文件夹下的ava_det.json文件的中的bbox的格式默认是 x1,y1,w,h而不是x1,y1,x2,y2,所以后续的AVADataset里的有个地方,就是Box(mode="xyxy).convert("xywh")。如果bbox的格式是x1,y1,x2,y2,则不需要convert.("xywh"),如果是默认的x1,y1,w,h,则需要convert("xywh")。这是个大坑。。好像也没有看到作者有说明。。。
Slowfast
SlowFast Networks for Video Recognition[1]
作者是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.论文引用[1]:Feichtenhofer, Christoph et al. “SlowFast Networks for Video Recognition.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2018): 6201-6210.
Time
- 2018.Dec
Key Words
- Slow pathway to capture spatial semantics
- lightweight Fast pathway to capture temporal motion and fine temporal resolution
动机
- all spatiotemporal orientations are not equally likely, there is no reason for us to treat space and time symmetrically.
- inspired by biological studies on the retinal ganglion cells in the
primate visual
system,受灵长类动物的视觉系统的视网膜神经节细胞的启发。一种Parvocellualr(P-cells)约80%,Magnocellualr(M-cells)约20%,
- M-cells operates at high temporal frequency \(\rightarrow\) fast temporal changes
- P-cells可以检测到空间信息:spatial detail and color, lower temporal resolution
Training With Detectron2
用Detectron2训练自己的目标检测数据集
主要是需要注册自己的数据集,然后使用数据集进行训练
from detectron2.data.datasets import register_coco_instances
register_coco_instances("train", {}, "json_annotation.json", "path/to/image/dir")
然后就是一些配置文件啥的
VideoMAE
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pretraining[1]
作者是Zhan Tong, Yibing Song, Jue Wang 和王利民,分别来自南大,腾讯和上海AI Lab,论文引用[1]:Tong, Zhan et al. “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training.” ArXiv abs/2203.12602 (2022): n. pag.
Time
Key Words
- video masked autoencoder using plain ViT backbones, tube masking with high ratio
- data-efficient learner that could be successfully trained with only 3.5k videos. Data quality more important than quantity for SSVP(self-supervised video pretraining) when a domain shift exists between source and target dataset.
动机
对于Video Transformers,通常是derived from 基于图像的transformer,严重依赖于从大规模图像数据的pre-trained models,高效地训练一个vanilla vision transformer on the video dataset without any pre-trianed model or extra image data是一个挑战。