一些参数的用法及含义

*args**kwargs 主要用于函数定义,可以将不定数量的参数传递给一个函数。这里的不定的意思是:预先并不知道, 函数使用者会传递多少个参数给你, 所以在这个场景下使用这两个关键字。 *args 是用来发送一个非键值对的可变数量的参数列表给一个函数.

**kwargs

  1. **kwargs 允许将不定长度的 键值对,作为参数传递给一个函数,如果要在一个函数里处理带名字的参数,应该使用 **kwargs.

*args

  1. *args是用来发送一个非键值对的可变数量的参数列表给一个函数

标准参数与 *args、**kwargs的使用顺序:

some_func(fargs, *args, **kwargs)

阅读全文 »

Hiera: A Hierachical Vision Transformer without the Bells-and-Whistles[1]

作者是来自Meta、Georgia Tech和John Hopkins 的Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer. 论文引用[1]:Ryali, Chaitanya K. et al. “Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles.” ArXiv abs/2306.00989 (2023): n. pag.

Time

  • 2023.Jun

Key Words

  • visual pretext task:MAE
  • hierarchical(multiscale) vision transformer
  • Mask unit attention vs Window attention
  • add spatial bias by teaching it to model using a strong pretext task like MAE instead of vision-specific modules like shifted windows or convs.
  • 一句话总结:就用MAE对MViTv2(encoder)进行预训练, 而不是vanilla ViT,同时去掉了MViTv2中的一些设计,用了mask unit attention,取得了很好的效果。

动机

  1. 现在很多hierarchical ViT加了很多vision-specific component,为了追求监督的分类性能,同时这些components带来了不错的精度和FLOP counts,增加的复杂度使得这些transformer比对应的vanilla ViT更慢。作者argue: 这些额外的bulk是不必要的,通过预训练一个强的visual pretext task(MAE),可以去掉很多花里胡哨的东西,同时也不降低精度。所以作者提出了Hiera.
阅读全文 »

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning[1]

作者是来自复旦大学和Microsoft Cloud+ AI团队的 Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang.论文引用[1]:Wang, Rui et al. “Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 6312-6322.

Time

  • 2022.Dec

Key Words

  • Masked Video Modeling/Masked Image Modeling
  • High-level features from video teacher and image teacher for continued masked feature prediction
  • spatial-temporal co-teaching
  • 简单地说,就是用MIM/MVM预训练的image 和video model作为masked feature prediction target, 用作student的teacher,从而实现视频表征学习。

动机

  • 对于自监督视觉表征学习,最近的MIM方法,例如MAE,BEiT,PeCO,用Vision transformer实现了很好的性能。这样的预训练范式,用在了video domain,使video transformer有了显著的提升。代表性的MVM(masked video modeling) 的工作包括:BEVT,VideoMAE和ST-MAE。跟着MAE和BEiT,现有的masked video modeling的方法通过重建low-level features来预训练video transformers,例如raw pixel values or low-level VQVAE tokens。然而,用low-level的features重建目标,通常存在噪声,由于视频数据的高度冗余,对于MVM很容易学习到shortcuts,因此造成在下游任务上有限的迁移性能。为了缓解这个问题,MVM通常用了更大的masking ratios。
阅读全文 »

Denoising Diffusion Autoencoders are Unified Self-supervised Learners[1]

作者是来自北航的Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang. 论文引用[1]: Xiang, Weilai et al. “Denoising Diffusion Autoencoders are Unified Self-supervised Learners.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023): 15756-15766.

Time

  • 2023.Mar

Key Words

  • generative (translation,...) and discriminative (classification, recognition) tasks
  • generative pre-training and denoising autoencoding
  • DDAE as generative models and competitive recognition models
  • exntend generative models for discriminative purposes
  • linear-separable features in unsupervised manner
  • latent space VS pixel-space
阅读全文 »

Auto-Encoding Variational Bayes[1]

两位作者是来自Universiteit van Amsterdam, Machine Learning Group, Diederik P. Kingma, Max Welling.论文引用:Kingma, Diederik P. and Max Welling. “Auto-Encoding Variational Bayes.” CoRR abs/1312.6114 (2013): n. pag.

Time

  • 2013.Dec

Key Words

  • reparameterization of variational lower bound
  • lower bound estimator
  • continuous latent variable with intractable posterior
  • i.i.d dataset with latent variables per datapoint

针对的问题

  1. how can we perform efficient approximate inference and learning with directed probabilistic models whose continuous latent variables or parameters have intractable posterior distributions?
阅读全文 »

Extracting and Composing Robust Features with Denoising Autoencoders[1]

这是来自蒙特利尔大学的团队于2008年发表的文章,作者是Pascal Vincent、Hugo Larochelle、Yoshua Bengio、Pierre-Antoine Manzagol。论文引用:Vincent, Pascal et al. “Extracting and composing robust features with denoising autoencoders.” International Conference on Machine Learning (2008).

Time

  • 2008.Feb

Key Words

总结

  1. 在学习深度生成或者判别模型的困难,可以被一个intial unsupervised learning step解决。这个step将输入maps to 有用的intermediate representations。作者提出了一个新的representation的无监督学习的方式,是基于making learned representations robust to partial corruption of the input pattern

  2. 每层产生输入模式(input pattern)的representation比之前的一层的representation更抽象,因为它是obtained by composing more operations。

    阅读全文 »

Emerging Properties in Self-Supervised Vision Transformers[1]

作者是来自FAIR、Inria和Sorbonne University的团队,论文引用[1]:Caron, Mathilde et al. “Emerging Properties in Self-Supervised Vision Transformers.” 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021): 9630-9640.

Time

  • 2021.Apr

动机

  1. Transformer在视觉里的成功是否由于在pretraining里的supervision。Transformer在NLP里的成功的一个主要因素是自监督预训练。
  2. 作者研究了自监督预训练 on ViT features. ### Key Words
  • Self-supervised ViT features
  • self-distillation with no labels (DINO)

总结

  1. 在ViT上的自监督预训练的特点,没有出现在supervised ViTs上的:
    • explicitly包含了scene layout 和,object boundaries,这个信息主要是在最后一个block的自注意力模块。
    • self-supervised ViT 用一个基本的k-NN就能在ImageNet上实现78.3%的准确率,补血药任何fintuning、线性分类器或者数据增强。
  2. 用k-NN实现很好的性能是在和momentum encoder和multi-crop augmentation结合情况下实现的。用smaller patches with ViTs能够提高resulting features的质量。
阅读全文 »

0%