DDDM
Deconstruting Denoising Diffusion Models for Self-supervised Learning[1]
作者是Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He,分别来自FAIR和NYU。论文引用:Chen, Xinlei et al. “Deconstructing Denoising Diffusion Models for Self-Supervised Learning.” ArXiv abs/2401.14404 (2024): n. pag.
Key Words
- Denoising Diffusion Models
- Denoising Autoencoder
- low-dimensional latent space
总结
- Denoising
是目前生成模型的核心,例如DDM,这些生成模型效果很好,看起来对于视觉内容有学习表征的能力。两个问题:
- 目前研究DDMs的表征能力是用off-the-shelf的预训练的DDMs,这个原本是用来生成的,现在用来评估识别的表征;
- 不清楚表征能力是通过denoising-driven过程还是diffusion-driven过程得到的。
 
- 文章的思路是:deconstruct DDM,将它逐步改成经典的DAE,通过这个过程检验它的各个方面。发现主要的一个component是tokenizer:create a low-dimensional latent space。the role of using multiple levels of noise is analogous to a form of data augmentation
- architecture 如下图所示。称为"l-DAE" for latent Denoising Autoencoder, 用single noise level (not using noise scheduling of DDM) 能够一个还可以的结果。argue that DDM的表征能力是由denoising-driven过程得到的,而不是diffusion-driven 过程得到的。 
- 解构的步骤: - 采用generation-focused settings in DiT to be more oriented toward self-supervised learning. 作者这里remove了DDM里的几种不同的损失、remove class-conditioning和用线性的noise schedule代替了DDM的noise schedule。结论是:生成的质量和自监督的学习能力不相关。
- 解构和简化tokenizer step by step.用了几种不同的tokenizer: Conv VAE、Patch-wise VAE、Patch-wise AE、Patch-wise PCA。
- 推动model towards 经典的DAE.
 
- 实验中,没有进行data augmentation, only cneter crops. 表明l-DAE的表征能力是和对数据的依赖无关 
 Figure1:The latent Denoising Autoencoder (l-DAE)
architecture wehaveultimately reached, after a thorough exploration of
decon structing Denoising Diffusion Models (DDM) [23], with the goal of
approaching the classical Denoising Autoencoder (DAE) [39] as much as
possible. Here, the clean image (left) is projected onto a latent space
using patch-wise PCA, in which noise is added(middle). It is then
projected back to pixels via inverse PCA. An autoencoder is learned to
predict a denoised image (right). This simple architecture largely
resembles classical DAE (with the main difference that noise is added to
the latent) and achieves competitive self-supervised learning
performance.
 Figure1:The latent Denoising Autoencoder (l-DAE)
architecture wehaveultimately reached, after a thorough exploration of
decon structing Denoising Diffusion Models (DDM) [23], with the goal of
approaching the classical Denoising Autoencoder (DAE) [39] as much as
possible. Here, the clean image (left) is projected onto a latent space
using patch-wise PCA, in which noise is added(middle). It is then
projected back to pixels via inverse PCA. An autoencoder is learned to
predict a denoised image (right). This simple architecture largely
resembles classical DAE (with the main difference that noise is added to
the latent) and achieves competitive self-supervised learning
performance.
 Figure2: A classical DAE and a modern DDM. (a) A
classical DAEthat adds and predicts noise on the image space. (b)
State-of-the-art DDMs (e.g., LDM [33], DIT [32]) that operate on a
latent space, where the noise is added and predicted.
 Figure2: A classical DAE and a modern DDM. (a) A
classical DAEthat adds and predicts noise on the image space. (b)
State-of-the-art DDMs (e.g., LDM [33], DIT [32]) that operate on a
latent space, where the noise is added and predicted.