DAE

Extracting and Composing Robust Features with Denoising Autoencoders[1]

这是来自蒙特利尔大学的团队于2008年发表的文章,作者是Pascal Vincent、Hugo Larochelle、Yoshua Bengio、Pierre-Antoine Manzagol。论文引用:Vincent, Pascal et al. “Extracting and composing robust features with denoising autoencoders.” International Conference on Machine Learning (2008).

Time

  • 2008.Feb

Key Words

总结

  1. 在学习深度生成或者判别模型的困难,可以被一个intial unsupervised learning step解决。这个step将输入maps to 有用的intermediate representations。作者提出了一个新的representation的无监督学习的方式,是基于making learned representations robust to partial corruption of the input pattern

  2. 每层产生输入模式(input pattern)的representation比之前的一层的representation更抽象,因为它是obtained by composing more operations。

  3. 问题是:一个好的intermediate representation需要满足什么标准?

    • 应该最小化保留输入的一定量的信息, 同时被限制在一个给定的形式(a real-valued vector of a given size in the case of an autoencoder)。
    • 一个补充的标准是:这种模型满足sparsity of representations.
    • 作者假设且调查研究的一个标准:robust to partial destrcution of the input, i.e:partially destructed inputs should yield almost the same representation.这个motivated by following:
      • a good representation is expected to capture stable structures in the form of dependencies and regularities characteristic of the (unkonwn) distribution of its observed input.一个好的representation是期望能够以其观察到的输入的分布的依赖和规律性特征的形式获得稳定的结构。人类的一个特征是能够从corrupted or occluded images中识别,Further evidence is 我们能够形成一个于多个模态关联的高级的概念,即使一些模态不见了也能够想起来。Inspired by Human Ability。
  4. Basic Autoencoder:

    • 输入:\(x \in [0,1]^d\),将其maps to a hidden representation \(y \in [0,1]^d\) through a deterministic mapping: \[\mathbf{y}=f_\theta(\mathbf{x})=s(\mathbf{W}\mathbf{x}+\mathbf{b})\], $$= {W,b} , W 是一个 \(d'\times d\) 的一个weight matrix,b是偏置$,然后latent representation \(y\) mapped back to a "reconstruct" vector \[z \in [0,1]^d, \mathbf{z}=g_{\theta^{\prime}}(\mathbf{y})=s(\mathbf{W}^{\prime}\mathbf{y}+\mathbf{b}^{\prime})\mathrm{~with~}\theta^{\prime}=\{\mathbf{W}^{\prime},\mathbf{b}^{\prime}\}.\], 每个训练的\(x\)都对应一个\(y\)和一个重建的\(z\),模型的参数是被优化,来使得平均重建误差最小。有点类似于PCA

    \[\begin{aligned}\theta^\star,\theta^{\prime\star}&=\arg\min_{\theta,\theta^\prime}\frac1n\sum_{i=1}^nL\left(\mathbf{x}^{(i)},\mathbf{z}^{(i)}\right)\\&=\arg\min_{\theta,\theta^\prime}\frac1n\sum_{i=1}^nL\left(\mathbf{x}^{(i)},g_{\theta^\prime}(f_\theta(\mathbf{x}^{(i)}))\right)\end{aligned}\] \(L\)是损失函数

  5. Denoising Autoencoder:

    • corrupting initial input \(x\) to get a partially destroyed version $ $ 通过 stochastic mapping: \(\tilde{\mathbf{x}}\sim q_{\mathcal{D}}(\tilde{\mathbf{x}}|\mathbf{x})\),将输入\(x\)一定比例的的components置0,剩下的不变,置0相当于椒盐噪声(salt noise)。剩下的过程类似于basic autoencoder。
  6. 作者denoising autoencoder的训练过程是将corrupted version恢复成clean input,就是一个denoising的任务。前人研究了很多图像去噪,一种gated autoencoders用于denoising in Memisevic(2007)。用自编码器denoising很早就有了(LeCun 1987)。作者研究corrupting noise的显式robustness,仅作为一个标准,来知道合适的intermediate representation的学习,目标是弄一个更好的通用目的地学习算法。 corruption + denoising procedure不仅应用在输入,而且recursively to intermediate representations。这个类似于数据增强,但是不依赖于先验知识。denoising autoencoder的非线性对于初始化一个深度神经网络很重要

  7. \(Y= f(X)\)可以解释为一个在manifold上的点的坐标系,更一般地,可以认为\(Y= f(X)\)\(X\)的一个representation,能够capture mainvariantions in the data,i.e on the manifold,当额外的criteria引入学习模型时,不能再将\(Y= f(X)\) 直接视为一个在manifold上的点的显式的低维坐标系,但它保留了capture the main factors of variation in the data的特点,其实还是和PCA等降维的思路相似,对数据进行压缩降维。

  8. 实验表明:3个隐藏层的神经网络通过stacking denoising autoencoders初始化,然后再分类任务上fine tune。corruption+ denoising training最为初始化的步骤表现很好,比basic autoencoder with no noise有更好的分类性能。

  9. 结论:这个是motivated by the goal of learning representation of the input that are robust to small irrelevant changes in input.用显式的denoising criterion来进行无监督的网络层的初始化能够得到输入分布的有趣的结构,This leads to intermediate representations better suited for subsequent learning tasks such as supervised classification.

Illustration \(Fig.2^{[1]}\): Illustration of what the denoising autoencoder is trying to learn. Suppose training data (crosses) concentrate near a low-dimensional manifold. A corrupted example(circle) is obtained by applying a corruption process $q_D(|X) $(left side). Corrupted examples (circles) are typically outside and farther from the manifold, hence the model learns with \(p(X|\hat{X})\) to map points to more likely points (right side). Mapping from more corrupted examples requires bigger jumps (longer dashed arrows)