An Image is Worth \(16 \times 16\) Words:Transformers For Image Recognition at Scale[1]

作者比较多,都是来自Google Research, Brain Team的Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 论文引用[1]:Dosovitskiy, Alexey et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” ArXiv abs/2010.11929 (2020): n. pag.

Time

  • 2020.Oct

Key Words

  • Vision Transformer
  • Image patches (in Vision) \(\Leftrightarrow\) tokens (words) in NLP
  • larger scale training

总结

  1. 自注意力机制的dominant approach is 在large text corpus预训练,然后在smaller task-specific dataset上进行微调。Thanks to Transformers' computational efficiency 和可扩展性(scalability), 训练一个over 100B parameters、unprecedented size的model成为可能。随着model和dataset的growing,没有饱和的迹象(no sign of saturating performance)
阅读全文 »

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions[1]

作者是来自Google Research、Inria Laboratoire Jean Kuntzmann, Grenoble, France, UC Berkeley的Chunhui Gu、Chen Sun、David A.Ross等人。论文引用[1]:Gu, Chunhui et al. “AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 6047-6056.

Time

  • 2017.May

Key Words

  • aotmic visual actions rather than composite actions
  • precise spatio-temporal annotations with possibly multiple annotations for each person
  • exhaustive annotation of these atomic actions over 15-minute video clips
  • people temporally linked across consecutive segments

总结

  1. dataset is sourced from 15th - 30th minute time intervals of 430 different movies, which given 1 Hz sampling frequency gives us nearly 900 keyframes for each movie. In each keyframe, every person is labeled with (possibly multiple) actions from AVA vocabulary. Each person is linked to the consecutive keyframes to provide short temporal sequences of action labels.

    阅读全文 »

一些参数的用法及含义

*args**kwargs 主要用于函数定义,可以将不定数量的参数传递给一个函数。这里的不定的意思是:预先并不知道, 函数使用者会传递多少个参数给你, 所以在这个场景下使用这两个关键字。 *args 是用来发送一个非键值对的可变数量的参数列表给一个函数.

**kwargs

  1. **kwargs 允许将不定长度的 键值对,作为参数传递给一个函数,如果要在一个函数里处理带名字的参数,应该使用 **kwargs.

*args

  1. *args是用来发送一个非键值对的可变数量的参数列表给一个函数

标准参数与 *args、**kwargs的使用顺序:

some_func(fargs, *args, **kwargs)

阅读全文 »

Hiera: A Hierachical Vision Transformer without the Bells-and-Whistles[1]

作者是来自Meta、Georgia Tech和John Hopkins 的Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer. 论文引用[1]:Ryali, Chaitanya K. et al. “Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles.” ArXiv abs/2306.00989 (2023): n. pag.

Time

  • 2023.Jun

Key Words

  • visual pretext task:MAE
  • hierarchical(multiscale) vision transformer
  • Mask unit attention vs Window attention
  • add spatial bias by teaching it to model using a strong pretext task like MAE instead of vision-specific modules like shifted windows or convs.
  • 一句话总结:就用MAE对MViTv2(encoder)进行预训练, 而不是vanilla ViT,同时去掉了MViTv2中的一些设计,用了mask unit attention,取得了很好的效果。

动机

  1. 现在很多hierarchical ViT加了很多vision-specific component,为了追求监督的分类性能,同时这些components带来了不错的精度和FLOP counts,增加的复杂度使得这些transformer比对应的vanilla ViT更慢。作者argue: 这些额外的bulk是不必要的,通过预训练一个强的visual pretext task(MAE),可以去掉很多花里胡哨的东西,同时也不降低精度。所以作者提出了Hiera.
阅读全文 »

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning[1]

作者是来自复旦大学和Microsoft Cloud+ AI团队的 Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang.论文引用[1]:Wang, Rui et al. “Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 6312-6322.

Time

  • 2022.Dec

Key Words

动机

阅读全文 »

Denoising Diffusion Autoencoders are Unified Self-supervised Learners[1]

作者是来自北航的Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang. 论文引用[1]: Xiang, Weilai et al. “Denoising Diffusion Autoencoders are Unified Self-supervised Learners.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023): 15756-15766.

Time

  • 2023.Mar

Key Words

  • generative (translation,...) and discriminative (classification, recognition) tasks
  • generative pre-training and denoising autoencoding
  • DDAE as generative models and competitive recognition models
  • exntend generative models for discriminative purposes
  • linear-separable features in unsupervised manner
  • latent space VS pixel-space
阅读全文 »

Auto-Encoding Variational Bayes[1]

两位作者是来自Universiteit van Amsterdam, Machine Learning Group, Diederik P. Kingma, Max Welling.论文引用:Kingma, Diederik P. and Max Welling. “Auto-Encoding Variational Bayes.” CoRR abs/1312.6114 (2013): n. pag.

Time

  • 2013.Dec

Key Words

  • reparameterization of variational lower bound
  • lower bound estimator
  • continuous latent variable with intractable posterior
  • i.i.d dataset with latent variables per datapoint

针对的问题

  1. how can we perform efficient approximate inference and learning with directed probabilistic models whose continuous latent variables or parameters have intractable posterior distributions?
阅读全文 »
0%