1. Introduction

확률 모델이 intractable posterior distribution을 갖는 경우 어떻게 효율적으로 모델을 학습시킬 수 있을까? Variational Bayesian (VB) approach는 intractable posterior의 근사치 최적화를 포함한다. 특히 SGVB (Stochastic Gradient Variational Bayes) 추정은 continuous latent variables 및 parameters를 사용하는 거의 모든 모델에 있어 효과적인 approximate inference를 가능하게 하며, stochastic gradient ascent technique를 이용하면 이를 간단하게 최적화할 수 있다.

Datapoint가 i.i.d.이고 continuous latent variable을 갖는 경우, Auto-Encoding VB (AEVB) 알고리즘을 사용하여 MCMC와 같이 반복적으로 추론할 필요 없이 효율적으로 모델을 학습시킬 수 있다. 학습된 추론 모델은 recognition, denoising, representation and visualization 등 다양한 작업에 사용될 수 있다. Neural network가 recognition model에 사용되는 경우, 이를 variational auto-encoder라고 한다.

Untitled

2. Method

이 섹션에서는 continuous latent variable을 갖는 directed graphical model에서 stochastic objective function인 lower bound estimator를 유도하는 과정을 설명하고 있다. 이때 dataset의 latent variable은 i.i.d.라는 가정 하에 parameter에 대해서는 maximum likelihood (ML)이나 maximum a posteriori (MAP) 추론을 수행하고 latent variable에 대해서는 variational inference를 수행한다.

2.1. Problem scenario

$N$개의 i.i.d. variable $\mathbf x$를 갖는 dataset $\mathbf X = \{\mathbf x^{(i)}\}_{i=1}^N$을 고려하자. 이때 data는 unobserved continuous random variable $\mathbf z$로부터 random process를 거쳐 생성되었다고 가정할 것이다. process는 다음의 두 단계로 구성된다.

(1) Prior distribution $p_{\boldsymbol \theta^*}(\mathbf z)$로부터 $\mathbf z^{(i)}$ 생성

(2) Conditional distribution $p_{\boldsymbol \theta^*}(\mathbf x |\mathbf z)$로부터 $\mathbf x^{(i)}$ 생성

이때 prior $p_{\boldsymbol \theta^}(\mathbf z)$와 likelihood $p_{\boldsymbol \theta^}(\mathbf x |\mathbf z)$는 $p_{\boldsymbol \theta}(\mathbf z)$와 $p_{\boldsymbol \theta}(\mathbf x |\mathbf z)$에 $\boldsymbol \theta = \boldsymbol \theta^$를 대입하여 얻을 수 있고, pdf는 $\boldsymbol \theta$와 $\mathbf z$에 대해 differentiable almost everywhere라고 가정한다. 불행히도 이 과정에서 true parameter $\boldsymbol \theta^$와 latent variable $\mathbf z^{(i)}$는 알 수 없는 상황이다.

여기서 중요한 점은, posterior을 구하기 위한 값 중 marginal probability p(x)를 쉽게 이용할 수 있다고 가정하지 않는다는 것이다. 다시 말해, 본 모델은 아래와 같은 경우에도 잘 작동하는 알고리즘에 관심이 있다.

Intractability: integral of marginal likelihood $p_{\boldsymbol \theta}(\mathbf x)=\int p_{\boldsymbol \theta}(\mathbf z)p_{\boldsymbol \theta}(\mathbf x|\mathbf z)$를 계산하기 어렵고, true posterior density $p_{\boldsymbol \theta}(\mathbf z|\mathbf x)= p_{\boldsymbol \theta}(\mathbf x|\mathbf z)p_{\boldsymbol \theta}(\mathbf z)/p_{\boldsymbol \theta}(\mathbf x)$를 계산하기 어려운 경우
- 이러한 intractability는 nonlinear hidden layer를 갖는 neural network 같은 모델에서 흔히 볼 수 있는 현상이다.
A large dataset: 데이터가 너무 많은 경우
- 이 경우 batch optimization는 매우 많은 연산량을 필요로 한다. Monte Carlo EM과 같은 Sampling Based Solution은 datapoint마다 sampling loop를 돌기 때문에 너무 느리다.