Unconditional Generation

Return of Unconditional Generation: A Self-supervised Representation Generation Method[1]

作者是来自MIT, CSAIL的Tianhong Li, Dina Katabi和何恺明。论文引用:Li, Tianhong et al. “Return of Unconditional Generation: A Self-supervised Representation Generation Method.” (2023).

对物体图像语义信息本质的理解,而不是停留在图像的模式、特征上,增强泛化性。由小到大,由细微到广大、从局部到整体的去理解、去学习特征表示。

Key Words

  • unconditional generation with unlabeled data.
  • self-supervised encoder: MoCov3 ViT-B
  • Representation Generation: RDM 12-block, 1536-hid-dim for 100 epochs
  • Image generation: MAGE-B for 200 epochs
  • Representation-Conditioned Generation(RCG)
  • generate semantic representations in the representation space ### 总结
  1. 生成模型作为无监督方法发展了很长时间,重要的工作如GAN、VAE和Diffusion Model。这些基础方法聚焦于数据的概率分布,不依赖于人类标注的availability。这个问题经常归类为无条件的生成(unconditional generation),是追求利用大量的无标注数据来学习复杂的分布。缩小有条件和无条件的生成是一个有价值的问题。释放出大规模的无标注数据的能量是必须的一步。

  2. 有自监督编码器生成的represetations能够捕捉很多语义的属性。可以将其作为condition,来生成images。成为一种新的、自监督的、不需要人类标注的unconditional generation的方法。RCG在ImageNet \(256 \times 256 benchmark\)上,实现了空前的2.15 FID for unconditional generation。性能很强,接下来后面看看实验效果如何。

  3. 生成模型主要是精确地对数据的分布进行建模,来生成新的data point that resembles original data.

    • 一类是:基于GANs的
    • 另一类是:基于两阶段的:先tokenize image into a latent space,然后在token space 里用最大似然估计和采样。Diffusion models在图像合成上实现了很好的效果。
  4. 无条件的生成(unconditional generation): 旨在不依靠人类的标注,对数据分布进行建模,重要的工作如:GAN, VAE和Diffusion Model。然而,在复杂的数据分布上,conditional 和unconditional的生成模型之间有很明显的gap。 之前的narrow this gap的工作主要是将图像在表征空间里分组为clusters,用clusters indices作为潜在的类别标签来提供conditioning。这些方法是假设或者前提条件的: dataset is clusterable等等。

  5. 图像生成的representations: DALL-E 2,是将text prompts转换成image embeddings,然后用image embeddings 作为conditions来生成图像。DiffAE: 以端到端的形式训练一个图像的encoder,然后用一个diffusion model作为一个decoder。旨在学习一个meaningful和decodable的图像表征。另一个工作是:retrieval-augmented generative models,图像时基于从存在的图像中抽取出来的representations来生成的,这些方法严重依赖于ground-truth images来提供representations during generation。

  6. 直接对复杂高维的图像分布进行建模是很有挑战的,作者的Method:

    • first modeling distribution of compact low-dimensional representation.
    • then modeling image distribution conditioned on this representation distribution.
    • RCG的三个部分: pretrained self-supervised image encoder, representation generator and image generator.
    • 用现成的image encoder将image distribution转换为
    1. representation distribution,这个encoder用自监督的对比学习方法(MoCov3)进行过预训练,这个得到的分布有两个重要的特点:足够简单,能够用无条件的representation generator进行建模;有丰富的高维语义内容,对于guiding image generation是重要的

    2. 用一个diffusion model来做无条件的representation generation, called representation diffusion model(RDM)。用的是全连接网络,将图像的representation和高斯噪声混合在一起,RDM的backbone训练用来denoise.

    3. image generator用来对masked image进行recontruct到原始图像,conditioned on同一幅图像的representation,作者这里用了MAGE,一个并行的decoding generative model。

    4. 在ImageNet 上做的评测,没有用到ImageNet labels。结论:此方法进行图像生成,不需要人类标注,结果很好。

RCG framework \(Fig. 1^{[1]}\):The Representation-Conditioned Generation (RCG) framework for unconditional generation. RCG consists of three parts: (a) it uses a pre-trained self-supervised encoder to map the image distribution to a representation distribution; (b) it learns a representation generator that samples from a noise distribution and generates a representation subject to the representation distribution; (c) it learns an image generator (e.g., which can be ADM, DiT, or MAGE) that maps a noise distribution to the image distribution conditioned on the representation distribution.

RCG training framework \(Fig. 2^{[1]}\): RCG’s training framework. The pre-trained self-supervised image encoder extracts representations from images and is fixed during training. To train the representation generator, we add standard Gaussian noise to the representations and ask the network to denoise them. To train the MAGE image generator, we add random masking to the tokenized image and ask the network to reconstruct the missing tokens conditioned on the representation extracted from the same image.