DDAE

Denoising Diffusion Autoencoders are Unified Self-supervised Learners[1]

作者是来自北航的Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang. 论文引用[1]: Xiang, Weilai et al. “Denoising Diffusion Autoencoders are Unified Self-supervised Learners.” 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023): 15756-15766.

Time

  • 2023.Mar

Key Words

  • generative (translation,...) and discriminative (classification, recognition) tasks
  • generative pre-training and denoising autoencoding
  • DDAE as generative models and competitive recognition models
  • exntend generative models for discriminative purposes
  • linear-separable features in unsupervised manner
  • latent space VS pixel-space

目的

  1. 受最近的diffusion models的启发,研究了denoising autoencoders是否能够通过generative预训练,得到discriminative representations for classification。Whether diffusion models can replicate the success of GPTs and T5, in becoming unified generative-and-discriminative learners. Observations:

    • Generative pre-training supports diffusion as a meaningful discriminative learning method
    • Denoising autoencoding 广泛地应用于discriminative visual representation learning
    • better image generation capability can translate to improved feature quality, suggesting that diffusion is even more capable of representation learning.
  2. 直接take the intermediate activations in pre-trained DDAEs,不需要对diffusion frameworks进行修改,和现有的莫i选哪个兼容。

  3. Diffusion models 可以被视为一个multi-level denoising autoencoders, 两个key factor to improve denoising pre-training:

    • number of noise levels
    • range of noise scales.

总结

  1. 通过端到端的diffusion pretraining,能够学习到strong linear-separable features.文章表明DDAE能够通过在无条件的image generation上的预训练,能够学习到很强的linear-separable representations within intermediate layers without 额外的encoders,使得diffusion pre-training能够作为一个生成-判别双学习的通用的方式。

DDAE \(Fig. 1^{[1]}\). Denoising Diffusion Autoencoders (DDAE). Top: Dif fusion networks are essentially equivalent to level-conditional denoising autoencoders (DAE). The networks are named as DDAEs due to this similarity. Bottom: By linear probe valuations, we confirm that DDAE can produce strong representations at some intermediate layers. Truncating and fine-tuning DDAE as vision encoders further leads to superior image classification performance