DDAE
Denoising Diffusion Autoencoders are Unified Self-supervised Learners[1]
作者是来自北航的Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang. 论文引用[1]: Xiang, Weilai et al. “Denoising Diffusion Autoencoders are Unified Self-supervised Learners.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023): 15756-15766.
Time
- 2023.Mar
Key Words
- generative (translation,...) and discriminative (classification, recognition) tasks
- generative pre-training and denoising autoencoding
- DDAE as generative models and competitive recognition models
- exntend generative models for discriminative purposes
- linear-separable features in unsupervised manner
- latent space VS pixel-space
目的
受最近的diffusion models的启发,研究了denoising autoencoders是否能够通过generative预训练,得到discriminative representations for classification。Whether diffusion models can replicate the success of GPTs and T5, in becoming unified generative-and-discriminative learners. Observations:
- Generative pre-training supports diffusion as a meaningful discriminative learning method
- Denoising autoencoding 广泛地应用于discriminative visual representation learning
- better image generation capability can translate to improved feature quality, suggesting that diffusion is even more capable of representation learning.
直接take the intermediate activations in pre-trained DDAEs,不需要对diffusion frameworks进行修改,和现有的莫i选哪个兼容。
Diffusion models 可以被视为一个multi-level denoising autoencoders, 两个key factor to improve denoising pre-training:
- number of noise levels
- range of noise scales.
总结
- 通过端到端的diffusion pretraining,能够学习到strong linear-separable features.文章表明DDAE能够通过在无条件的image generation上的预训练,能够学习到很强的linear-separable representations within intermediate layers without 额外的encoders,使得diffusion pre-training能够作为一个生成-判别双学习的通用的方式。
\(Fig. 1^{[1]}\). Denoising Diffusion Autoencoders (DDAE). Top: Dif fusion networks are essentially equivalent to level-conditional denoising autoencoders (DAE). The networks are named as DDAEs due to this similarity. Bottom: By linear probe valuations, we confirm that DDAE can produce strong representations at some intermediate layers. Truncating and fine-tuning DDAE as vision encoders further leads to superior image classification performance