Extracting and Composing Robust Features with Denoising Autoencoders[1]

这是来自蒙特利尔大学的团队于2008年发表的文章,作者是Pascal Vincent、Hugo Larochelle、Yoshua Bengio、Pierre-Antoine Manzagol。论文引用:Vincent, Pascal et al. “Extracting and composing robust features with denoising autoencoders.” International Conference on Machine Learning (2008).

Time

  • 2008.Feb

Key Words

总结

  1. 在学习深度生成或者判别模型的困难,可以被一个intial unsupervised learning step解决。这个step将输入maps to 有用的intermediate representations。作者提出了一个新的representation的无监督学习的方式,是基于making learned representations robust to partial corruption of the input pattern

  2. 每层产生输入模式(input pattern)的representation比之前的一层的representation更抽象,因为它是obtained by composing more operations。

    阅读全文 »

Emerging Properties in Self-Supervised Vision Transformers[1]

作者是来自FAIR、Inria和Sorbonne University的团队,论文引用[1]:Caron, Mathilde et al. “Emerging Properties in Self-Supervised Vision Transformers.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 9630-9640.

Time

  • 2021.Apr

动机

  1. Transformer在视觉里的成功是否由于在pretraining里的supervision。Transformer在NLP里的成功的一个主要因素是自监督预训练。
  2. 作者研究了自监督预训练 on ViT features. ### Key Words
  • Self-supervised ViT features
  • self-distillation with no labels (DINO)

总结

  1. 在ViT上的自监督预训练的特点,没有出现在supervised ViTs上的:
    • explicitly包含了scene layout 和,object boundaries,这个信息主要是在最后一个block的自注意力模块。
    • self-supervised ViT 用一个基本的k-NN就能在ImageNet上实现78.3%的准确率,补血药任何fintuning、线性分类器或者数据增强。
  2. 用k-NN实现很好的性能是在和momentum encoder和multi-crop augmentation结合情况下实现的。用smaller patches with ViTs能够提高resulting features的质量。
阅读全文 »

Return of Unconditional Generation: A Self-supervised Representation Generation Method[1]

作者是来自MIT, CSAIL的Tianhong Li, Dina Katabi和何恺明。论文引用:Li, Tianhong et al. “Return of Unconditional Generation: A Self-supervised Representation Generation Method.” (2023).

对物体图像语义信息本质的理解,而不是停留在图像的模式、特征上,增强泛化性。由小到大,由细微到广大、从局部到整体的去理解、去学习特征表示。

Key Words

  • unconditional generation with unlabeled data.
  • self-supervised encoder: MoCov3 ViT-B
  • Representation Generation: RDM 12-block, 1536-hid-dim for 100 epochs
  • Image generation: MAGE-B for 200 epochs
  • Representation-Conditioned Generation(RCG)
  • generate semantic representations in the representation space ### 总结
  1. 生成模型作为无监督方法发展了很长时间,重要的工作如GAN、VAE和Diffusion Model。这些基础方法聚焦于数据的概率分布,不依赖于人类标注的availability。这个问题经常归类为无条件的生成(unconditional generation),是追求利用大量的无标注数据来学习复杂的分布。缩小有条件和无条件的生成是一个有价值的问题。释放出大规模的无标注数据的能量是必须的一步。

    阅读全文 »

Deconstruting Denoising Diffusion Models for Self-supervised Learning[1]

作者是Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He,分别来自FAIR和NYU。论文引用:Chen, Xinlei et al. “Deconstructing Denoising Diffusion Models for Self-Supervised Learning.” ArXiv abs/2401.14404 (2024): n. pag.

Key Words

  • Denoising Diffusion Models
  • Denoising Autoencoder
  • low-dimensional latent space

总结

  1. Denoising 是目前生成模型的核心,例如DDM,这些生成模型效果很好,看起来对于视觉内容有学习表征的能力。两个问题:
    • 目前研究DDMs的表征能力是用off-the-shelf的预训练的DDMs,这个原本是用来生成的,现在用来评估识别的表征;
    • 不清楚表征能力是通过denoising-driven过程还是diffusion-driven过程得到的。
  2. 文章的思路是:deconstruct DDM,将它逐步改成经典的DAE,通过这个过程检验它的各个方面。发现主要的一个component是tokenizer:create a low-dimensional latent space。the role of using multiple levels of noise is analogous to a form of data augmentation
阅读全文 »

有没有能够可视化代码函数的执行和调用的工具,看一些大的Project,函数的调用看起来比较乱,记不住、理不清关联。

双视角, 三维轨迹

BEiT: BERT Pre-Training of Image Transformers[1]

论文的作者是来自哈工大和微软的Hangbo Bao, Li Dong, Songhao Piao 和 Furu Wei。论文引用:Bao, Hangbo et al. “BEiT: BERT Pre-Training of Image Transformers.” ArXiv abs/2106.08254 (2021): n. pag.

Time

  • 2021.Jun

Key Words

  • Self-supervised vision representation model:BEiT
  • pre-training task: masked image modeling(MIM)
  • two views of image representation: image patches(input) and visual tokens(output)

针对的问题

  1. 直接将BERT的思路用于image data有挑战:

    • there is no pre-exist vocabulary for ViT's input unit,i.e. image patches.所以不能简单地用一个softmax分类器来预测所有可能的masked patches的candidates.
    • 一个直接的思路是将任务视为回归问题,预测masked patches的raw pixels,然后pixel-level 的恢复任务tends to waster modeling capability on pre-training shortrange dependencies and high-frequency details.
      阅读全文 »

Masked Autoencoders As Spatiotemporal Learners[1]

作者们是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He。论文引用[1]:Feichtenhofer, Christoph et al. “Masked Autoencoders As Spatiotemporal Learners.” ArXiv abs/2205.09113 (2022): n. pag.

Key Words

  • extension MAE for video

  • minial domain knowledge

阅读全文 »

对VideoMAE进行训练或者微调的遇到的Bug

  1. 训练videoMAE时报错, File "/home/MAE-Action-Detection/run_class_finetuning.py", line 404, in main train_stats = train_one_epoch( File "/home/MAE-Action-Detection/engine_for_finetuning.py", line 59, in train_one_epoch for step, (samples, boxes, _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)): File "/home/MAE-Action-Detection/utils.py", line 141, in log_every for obj in iterable: File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/home/anaconda3/envs/VideoMAE/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) File "av/error.pyx", line 78, in av.error.FFmpegError.init TypeError: init() takes at least 3 positional arguments (2 given)

解决方法是:将torch, torchvision等相应的版本升级到1.13就行了,原来是1.9

  1. AVA数据集格式的问题

参照AlphAction的格式,其中,boxs文件夹下的ava_det.json文件的中的bbox的格式默认是 x1,y1,w,h而不是x1,y1,x2,y2,所以后续的AVADataset里的有个地方,就是Box(mode="xyxy).convert("xywh")。如果bbox的格式是x1,y1,x2,y2,则不需要convert.("xywh"),如果是默认的x1,y1,w,h,则需要convert("xywh")。这是个大坑。。好像也没有看到作者有说明。。。

SlowFast Networks for Video Recognition[1]

作者是来自FAIR的Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.论文引用[1]:Feichtenhofer, Christoph et al. “SlowFast Networks for Video Recognition.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2018): 6201-6210.

Time

  • 2018.Dec

Key Words

  • Slow pathway to capture spatial semantics
  • lightweight Fast pathway to capture temporal motion and fine temporal resolution

动机

  1. all spatiotemporal orientations are not equally likely, there is no reason for us to treat space and time symmetrically.
  2. inspired by biological studies on the retinal ganglion cells in the primate visual system,受灵长类动物的视觉系统的视网膜神经节细胞的启发。一种Parvocellualr(P-cells)约80%,Magnocellualr(M-cells)约20%,
    • M-cells operates at high temporal frequency \(\rightarrow\) fast temporal changes
    • P-cells可以检测到空间信息:spatial detail and color, lower temporal resolution
      阅读全文 »
0%