DDDM

发表于 2024-03-27 更新于 2025-04-29 分类于 Papers 阅读次数：本文字数： 594 阅读时长 ≈ 2 分钟

Deconstruting Denoising Diffusion Models for Self-supervised Learning^[1]

作者是Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He，分别来自FAIR和NYU。论文引用：Chen, Xinlei et al. “Deconstructing Denoising Diffusion Models for Self-Supervised Learning.” ArXiv abs/2401.14404 (2024): n. pag.

Key Words

Denoising Diffusion Models
Denoising Autoencoder
low-dimensional latent space

总结

Denoising 是目前生成模型的核心，例如DDM，这些生成模型效果很好，看起来对于视觉内容有学习表征的能力。两个问题：
- 目前研究DDMs的表征能力是用off-the-shelf的预训练的DDMs，这个原本是用来生成的，现在用来评估识别的表征；
- 不清楚表征能力是通过denoising-driven过程还是diffusion-driven过程得到的。
文章的思路是：deconstruct DDM，将它逐步改成经典的DAE，通过这个过程检验它的各个方面。发现主要的一个component是tokenizer：create a low-dimensional latent space。the role of using multiple levels of noise is analogous to a form of data augmentation

architecture 如下图所示。称为"l-DAE" for latent Denoising Autoencoder, 用single noise level (not using noise scheduling of DDM) 能够一个还可以的结果。argue that DDM的表征能力是由denoising-driven过程得到的,而不是diffusion-driven 过程得到的。
解构的步骤：
- 采用generation-focused settings in DiT to be more oriented toward self-supervised learning. 作者这里remove了DDM里的几种不同的损失、remove class-conditioning和用线性的noise schedule代替了DDM的noise schedule。结论是：生成的质量和自监督的学习能力不相关。
- 解构和简化tokenizer step by step.用了几种不同的tokenizer: Conv VAE、Patch-wise VAE、Patch-wise AE、Patch-wise PCA。
- 推动model towards 经典的DAE.
实验中，没有进行data augmentation, only cneter crops. 表明l-DAE的表征能力是和对数据的依赖无关

Figure1:The latent Denoising Autoencoder (l-DAE) architecture wehaveultimately reached, after a thorough exploration of decon structing Denoising Diffusion Models (DDM) [23], with the goal of approaching the classical Denoising Autoencoder (DAE) [39] as much as possible. Here, the clean image (left) is projected onto a latent space using patch-wise PCA, in which noise is added(middle). It is then projected back to pixels via inverse PCA. An autoencoder is learned to predict a denoised image (right). This simple architecture largely resembles classical DAE (with the main difference that noise is added to the latent) and achieves competitive self-supervised learning performance.

related work Figure2: A classical DAE and a modern DDM. (a) A classical DAEthat adds and predicts noise on the image space. (b) State-of-the-art DDMs (e.g., LDM [33], DIT [32]) that operate on a latent space, where the noise is added and predicted.

Deconstruting Denoising Diffusion Models for Self-supervised Learning[1]

Key Words

总结

Deconstruting Denoising Diffusion Models for Self-supervised Learning^[1]