ViT
An Image is Worth \(16 \times 16\) Words:Transformers For Image Recognition at Scale[1]
作者比较多,都是来自Google Research, Brain Team的Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer,Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. 论文引用[1]:Dosovitskiy, Alexey et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” ArXiv abs/2010.11929 (2020): n. pag.
Time
- 2020.Oct
Key Words
- Vision Transformer
- Image patches (in Vision) \(\Leftrightarrow\) tokens (words) in NLP
- larger scale training
总结
- 自注意力机制的dominant approach is 在large text corpus预训练,然后在smaller task-specific dataset上进行微调。Thanks to Transformers' computational efficiency 和可扩展性(scalability), 训练一个over 100B parameters、unprecedented size的model成为可能。随着model和dataset的growing,没有饱和的迹象(no sign of saturating performance)
AVA Dataset
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions[1]
作者是来自Google Research、Inria Laboratoire Jean Kuntzmann, Grenoble, France, UC Berkeley的Chunhui Gu、Chen Sun、David A.Ross等人。论文引用[1]:Gu, Chunhui et al. “AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 6047-6056.
Time
- 2017.May
Key Words
- aotmic visual actions rather than composite actions
- precise spatio-temporal annotations with possibly multiple annotations for each person
- exhaustive annotation of these atomic actions over 15-minute video clips
- people temporally linked across consecutive segments
总结
dataset is sourced from 15th - 30th minute time intervals of 430 different movies, which given 1 Hz sampling frequency gives us nearly 900 keyframes for each movie. In each keyframe, every person is labeled with (possibly multiple) actions from AVA vocabulary. Each person is linked to the consecutive keyframes to provide short temporal sequences of action labels.
KalmanFiltering
Kalman Filtering(卡尔曼滤波)
- 卡尔曼滤波是最常用最重要的状态估计算法之一。卡尔曼滤波能从不确定且非精确的测量中估计隐藏状态,同时还可以根据历史估计值对未来系统状态进行预测。滤波算法以 Rudolf E.Kalman的名字命名。在1960年,卡尔曼发表了著名的论文,描述了一个离散数据的线性滤波问题的递归算法。如今它被广泛应用于目标追踪、定位和导航系统、控制系统等领域。
Python Notes
一些参数的用法及含义
*args 和 **kwargs 主要用于函数定义,可以将不定数量的参数传递给一个函数。这里的不定的意思是:预先并不知道, 函数使用者会传递多少个参数给你, 所以在这个场景下使用这两个关键字。 *args 是用来发送一个非键值对的可变数量的参数列表给一个函数.
**kwargs
- **kwargs 允许将不定长度的 键值对,作为参数传递给一个函数,如果要在一个函数里处理带名字的参数,应该使用 **kwargs.
*args
- *args是用来发送一个非键值对的可变数量的参数列表给一个函数
标准参数与 *args、**kwargs的使用顺序:
some_func(fargs, *args, **kwargs)
HieraViT
Hiera: A Hierachical Vision Transformer without the Bells-and-Whistles[1]
作者是来自Meta、Georgia Tech和John Hopkins 的Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer. 论文引用[1]:Ryali, Chaitanya K. et al. “Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles.” ArXiv abs/2306.00989 (2023): n. pag.
Time
- 2023.Jun
Key Words
- visual pretext task:MAE
- hierarchical(multiscale) vision transformer
- Mask unit attention vs Window attention
- add spatial bias by teaching it to model using a strong pretext task like MAE instead of vision-specific modules like shifted windows or convs.
- 一句话总结:就用MAE对MViTv2(encoder)进行预训练, 而不是vanilla ViT,同时去掉了MViTv2中的一些设计,用了mask unit attention,取得了很好的效果。
动机
- 现在很多hierarchical ViT加了很多vision-specific component,为了追求监督的分类性能,同时这些components带来了不错的精度和FLOP counts,增加的复杂度使得这些transformer比对应的vanilla ViT更慢。作者argue: 这些额外的bulk是不必要的,通过预训练一个强的visual pretext task(MAE),可以去掉很多花里胡哨的东西,同时也不降低精度。所以作者提出了Hiera.
MVD
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning[1]
作者是来自复旦大学和Microsoft Cloud+ AI团队的 Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang.论文引用[1]:Wang, Rui et al. “Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning.” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022): 6312-6322.
Time
- 2022.Dec
Key Words
- Masked Video Modeling/Masked Image Modeling
- High-level features from video teacher and image teacher for continued masked feature prediction
- spatial-temporal co-teaching
- 简单地说,就是用MIM/MVM预训练的image 和video model作为masked feature prediction target, 用作student的teacher,从而实现视频表征学习。
动机
- 对于自监督视觉表征学习,最近的MIM方法,例如MAE,BEiT,PeCO,用Vision transformer实现了很好的性能。这样的预训练范式,用在了video domain,使video transformer有了显著的提升。代表性的MVM(masked video modeling) 的工作包括:BEVT,VideoMAE和ST-MAE。跟着MAE和BEiT,现有的masked video modeling的方法通过重建low-level features来预训练video transformers,例如raw pixel values or low-level VQVAE tokens。然而,用low-level的features重建目标,通常存在噪声,由于视频数据的高度冗余,对于MVM很容易学习到shortcuts,因此造成在下游任务上有限的迁移性能。为了缓解这个问题,MVM通常用了更大的masking ratios。
DEiT
DDAE
Denoising Diffusion Autoencoders are Unified Self-supervised Learners[1]
作者是来自北航的Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang. 论文引用[1]: Xiang, Weilai et al. “Denoising Diffusion Autoencoders are Unified Self-supervised Learners.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023): 15756-15766.
Time
- 2023.Mar
Key Words
- generative (translation,...) and discriminative (classification, recognition) tasks
- generative pre-training and denoising autoencoding
- DDAE as generative models and competitive recognition models
- exntend generative models for discriminative purposes
- linear-separable features in unsupervised manner
- latent space VS pixel-space