DDDM

Deconstruting Denoising Diffusion Models for Self-supervised Learning[1]

作者是Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He,分别来自FAIR和NYU。论文引用:Chen, Xinlei et al. “Deconstructing Denoising Diffusion Models for Self-Supervised Learning.” ArXiv abs/2401.14404 (2024): n. pag.

Key Words

  • Denoising Diffusion Models
  • Denoising Autoencoder
  • low-dimensional latent space

总结

  1. Denoising 是目前生成模型的核心,例如DDM,这些生成模型效果很好,看起来对于视觉内容有学习表征的能力。两个问题:
    • 目前研究DDMs的表征能力是用off-the-shelf的预训练的DDMs,这个原本是用来生成的,现在用来评估识别的表征;
    • 不清楚表征能力是通过denoising-driven过程还是diffusion-driven过程得到的。
  2. 文章的思路是:deconstruct DDM,将它逐步改成经典的DAE,通过这个过程检验它的各个方面。发现主要的一个component是tokenizer:create a low-dimensional latent space。the role of using multiple levels of noise is analogous to a form of data augmentation
  1. architecture 如下图所示。称为"l-DAE" for latent Denoising Autoencoder, 用single noise level (not using noise scheduling of DDM) 能够一个还可以的结果。argue that DDM的表征能力是由denoising-driven过程得到的,而不是diffusion-driven 过程得到的。

  2. 解构的步骤:

    • 采用generation-focused settings in DiT to be more oriented toward self-supervised learning. 作者这里remove了DDM里的几种不同的损失、remove class-conditioning和用线性的noise schedule代替了DDM的noise schedule。结论是:生成的质量和自监督的学习能力不相关。
    • 解构和简化tokenizer step by step.用了几种不同的tokenizer: Conv VAE、Patch-wise VAE、Patch-wise AE、Patch-wise PCA。
    • 推动model towards 经典的DAE.
  3. 实验中,没有进行data augmentation, only cneter crops. 表明l-DAE的表征能力是和对数据的依赖无关

Architecture Figure1:The latent Denoising Autoencoder (l-DAE) architecture wehaveultimately reached, after a thorough exploration of decon structing Denoising Diffusion Models (DDM) [23], with the goal of approaching the classical Denoising Autoencoder (DAE) [39] as much as possible. Here, the clean image (left) is projected onto a latent space using patch-wise PCA, in which noise is added(middle). It is then projected back to pixels via inverse PCA. An autoencoder is learned to predict a denoised image (right). This simple architecture largely resembles classical DAE (with the main difference that noise is added to the latent) and achieves competitive self-supervised learning performance.

related work Figure2: A classical DAE and a modern DDM. (a) A classical DAEthat adds and predicts noise on the image space. (b) State-of-the-art DDMs (e.g., LDM [33], DIT [32]) that operate on a latent space, where the noise is added and predicted.