Discriminative and generative models are two main approaches to the classification problem in machine learning. While both aim at finding \(p(y \mid \mathbf x)\), where \(\mathbf x\) and \(y\) respectively denote the input (data) and the label, discriminative models directly estimate \(p(y \mid \mathbf x)\) and generative models first model \(p(\mathbf x)\) and \(p(\mathbf x, y)\) and use the Bayes’ Rule to find \(p(y \mid \mathbf x)\). Learning \(p(\mathbf x)\) and \(p(\mathbf x, y)\) in the inference procedure enables a wider array of tasks, including generating a new set of data that follows the distribution of the training data. Generative Adversarial Networks (GAN) and diffusion models such as the DALL-E model family, are popular generative models that leverage artificial neural networks.

This post discusses variational autoencoders (VAEs) and their mathematical details. We first take a look at the two components hinted by its name (“variational” and “autoencoder”), their practical implementations using artificial neural networks, as well as the information on the optimization using stochastic gradients.

Variational inference

What is variational inference?

The word “variational” in VAE refers to variational inference, which is an idea from the calculus of variations. Variational inference is a method that approximates probability distributions that are difficult to compute directly through optimization (Blei, Kucukelbir & McAuliffe, 2018), often preferred over a popular and more “classic” alternative, Markov Chain Monte Carlo (MCMC), which can take an exponential time when the dataset is huge or the model is overly complex. We introduce latent variables \(\mathbf z\) that account for the unobserved factors behind the observed \(\mathbf x\). This allows us to circumvent a direct modeling of \(p(\mathbf x)\): \(p(\mathbf x) = \displaystyle\int p(\mathbf x, \mathbf z) d\mathbf z = \displaystyle\int p(\mathbf x \mid \mathbf z) p(\mathbf z) d\mathbf z \), which defines a generative process by sampling \(\mathbf z \sim p(\mathbf z)\), then \(\mathbf x \sim p(\mathbf x \mid \mathbf z)\).

The evidence is intractible

One might question why we are approximating \(p(\mathbf z \mid \mathbf x)\) in the first place. The answer is in the intractibility of the marginal, also called as the evidence. Specifically, we have

$$ p(\mathbf z \mid \mathbf x) = \frac{p(\mathbf x, \mathbf z)}{p(\mathbf x)} = \displaystyle\frac{p(\mathbf x, \mathbf z)}{\displaystyle\int p(\mathbf x, \mathbf z) d\mathbf z} $$

and the evidence in the denominator involves an expression that is intractible in real life applications or takes a very long time to compute.

So we approximate instead

Instead of directly computing \(p(\mathbf z \mid \mathbf x)\), the central theme of variational inference is in approximating the posterior \(p(\mathbf z \mid \mathbf x)\) with a distribution that belongs to a family of simpler distributions over \(\mathbf z\), e.g. family of Gaussian distributions, and is close to the posterior itself. The distance between distributions is measured by the reverse Kullback-Leibler divergence. If we initially posit a family \(\mathcal Q\) of distributions, the distribution of interest can be mathematically expressed as:

$$ q^*(\mathbf z) = {\arg\min}_{q(\mathbf z) \in \mathcal Q} \, D_\text{KL} \left( q(\mathbf z) \, \lVert \, p(\mathbf z \mid \mathbf x) \right). $$

You might have noticed that we’re utilizing the reverse KL divergence as opposed to the forward KL divergence. In general, we have \(D_{\text{KL}} (P \, \lVert \, Q) \neq D_{\text{KL}} (Q \, \lVert \, P)\) for two probability distributions \(P\) and \(Q\). The benefit of using the reverse KL divergence in our current discussion and in the expectation-maximization (EM) algorithm is due to the easier computation. Notice that we have the flexibility of designing \(q(\mathbf z)\), so the reverse KL divergence makes computing expectations much simpler.

Let’s expand the expression using the definition of KL divergence.

$$ \begin{align*} D_\text{KL} \left( q(\mathbf z) \, \lVert \, p(\mathbf z \mid \mathbf x) \right) &= \int q(\mathbf z) \log \frac{q(\mathbf z)}{p(\mathbf z \mid \mathbf x)} d\mathbf z \\ &= \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log \frac{q(\mathbf z)}{p(\mathbf z \mid \mathbf x)} \right] \\ &= \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log q(\mathbf z) \right] - \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log p(\mathbf z \mid \mathbf x) \right] \\ &= \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log q(\mathbf z) \right] - \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log p(\mathbf x, \mathbf z) \right] + \log p(\mathbf x), \end{align*} $$

where we used \( \mathbb E_{{\mathbf z \sim q(\mathbf z)}} \left[ \log p(\mathbf x) \right] = \log p(\mathbf x) \) in the last equation, for the expression does not depend on \(\mathbf z\). At this point, we can notice two things:

  1. The KL divergence is also not directly computable this way because of the intractibility of \(p(\mathbf x)\);
  2. \(\log p(\mathbf x)\) is a fixed quantity respect to \(q(\mathbf z)\).

Reorganizing the above equation, we obtain the following equation, where the right-hand side can be decomposed into two parts:

$$ \log p(\mathbf x) = \underbrace{D_{\text{KL}} \left( q(\mathbf z) \, \lVert \, p(\mathbf z \mid \mathbf x) \right)}_{[\mathbf 1]} + \underbrace{\mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log p(\mathbf x, \mathbf z) \right] - \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log q(\mathbf z) \right]}_{[\mathbf 2]}. $$

Since \(\log p(\mathbf x)\) is a fixed quantity, a higher value of \([\mathbf 1]\) equates to a lower value of \([\mathbf 2]\), and vice versa. Hence, as our initial objective was minimizing \([\mathbf 1]\), our alternative objective can be set to maximizing \([\mathbf 2]\), which we call the Evidence Lower Bound.

Evidence Lower Bound

The Evidence Lower Bound (ELBO), or the variational lower bound, is:

$$ \begin{align*} \mathcal L(\mathbf x) &= \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log p(\mathbf x, \mathbf z) \right] - \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log q(\mathbf z) \right] \\ &= \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log \frac{p(\mathbf x, \mathbf z)}{q(\mathbf z)} \right]. \end{align*} $$

Noticing that the KL divergence is always non-negative, we find that

$$ \begin{align*} \mathcal L(\mathbf x) &= \log p(\mathbf x) - \underbrace{D_{\text{KL}} \left( q(\mathbf z) \, \lVert \, p(\mathbf z \mid \mathbf x) \right)}_{\geq 0} \leq \log p(\mathbf x). \end{align*} $$

It is also possible to show this inequality using Jensen’s inequality:

$$ \begin{align*} \log p(\mathbf x) &= \log \int p(\mathbf x, \mathbf z) d\mathbf z \\ &= \log \int q(\mathbf z) \frac{p(\mathbf x, \mathbf z)}{q(\mathbf z)} d\mathbf z \\ &= \log \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \frac{p(\mathbf x, \mathbf z)}{q(\mathbf z)} \right] \\ &\geq \mathbb E_{\mathbf z \sim q(\mathbf z)} \left[ \log \frac{p(\mathbf x, \mathbf z)}{q(\mathbf z)} \right] \\ &= \mathcal L(\mathbf x). \end{align*} $$

The ELBO, as its name signifies, is the lower bound of the log-evidence. Our objective can thus be set to maximizing the ELBO, so that the KL divergence is minimized.

Let’s now briefly discuss the second piece of this model before we transition to a mathematically more nuanced section on how to optimize the model parameters.

Autoencoder

VAEs are a type of autoencoders, which comprises of an encoder and a decoder. An encoder transforms the input data to an efficient representation of the data (often called the “code”), and a decoder reconstructs the data given the code. A variational autoencoder combines the standard autoencoder setup and the variational Bayesian methods we briefly touched on in the previous section. An encoder takes the original data \(\mathbf x\) and transforms it into its latent representations \(\mathbf z\). The decoder takes \(\mathbf z\) and generates \(\tilde{\mathbf{x}}\) trying to reconstruct it to be as similar as \(\mathbf x\). This overall setup resembles diffusion models.

Basics of VAE

In training a VAE, what the encoder does is equivalent to generating \(\mathbf z\) from \(\mathbf x\) or, mathematically, learning the density of \(p(\mathbf z \mid \mathbf x)\). Since this distribution is intractible, we instead proceeded to finding \(q^{\ast}(\mathbf z)\) whose KL divergence with \(p(\mathbf z \mid \mathbf x)\) was minimized. We can make this process more explicit by means of subscripts. Our encoder doesn’t directly work with the parameters of the distribution from which the original data are sampled from. Instead, we work with variational parameters \(\mathbf \phi\) and an inference model \(q_{\mathbf \phi} (\mathbf z \mid \mathbf x)\) parameterized over \(\mathbf \phi\). In practice, \(q_{\mathbf \phi}(\mathbf z \mid \mathbf x)\) is parameterized using deep neural networks. Moreover, we utilize a single encoder in VAEs, i.e. shared variational parameters \(\mathbf \phi\). This approach is called the amortized variational inference (Gershman & Goodman, 2014).

The decoder doesn’t share \(\mathbf \phi\) with the encoder, and the objective of the decoder is learning a distribution that achieves a good reconstruction of \(\mathbf x\) from \(\mathbf z\). The decoder is often represented as \(p_{{\mathbf \theta}} (\mathbf x \mid \mathbf z)\). Combining both pieces together, a VAE can be visually depicted as follows: [image].

VAE is a type of autoencoder used as generative models, and there are numerous variants devised for different tasks. For instance, the denoising autoencoder (DAE) receives a corrupted input and is trained to predict an uncorrupted input, which makes it particularly beneficial for image denoising tasks and anomaly detection. For a list of autoencoder variants, check out Chapter 14 of the Deep Learning book and the section on Wikipedia.

Updating model parameters: jointly optimizing \(\mathbf \theta\) and \(\mathbf \phi\)

So far, we’ve established \( \log p_{\mathbf \theta} (\mathbf x) \geq \mathcal L_{\mathbf \theta, \mathbf \phi} (\mathbf x)\), where the ELBO \(\mathcal L\) is the quantity we aim to maximize:

$$ \mathcal L_{\mathbf \theta, \mathbf \phi} (\mathbf x) = \mathbb E_{\mathbf z \sim q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} \left[ \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \right]. $$

In order to train a VAE so that it generates a decent output, we need to update the model parameters \(\mathbf \theta\) and \(\mathbf \phi\). To do so, we utilize the standard method of gradient computations followed by parameter updates. While our ELBO is defined for a single input \(\mathbf x\), but we can also extend it to define the ELBO for the entire dataset \(\mathcal D\) (or a minibatch):

$$ \mathcal L_{\mathbf \theta, \mathbf \phi} (\mathcal D) = \sum_{\mathbf x \in \mathcal D} \mathcal L_{\mathbf \theta, \mathbf \phi} (\mathbf x). $$

Why the straightforward approach is insufficient

We need to obtain unbiased gradients of the ELBO with respect to \(\mathbf \theta\) and \(\mathbf \phi\). The gradient of \(\mathcal L_{{\mathbf \theta, \mathbf \phi}} (\mathbf x)\) with respect to \(\mathbf \theta\) is straightforward:

$$ \begin{align*} \nabla_{\mathbf \theta} \mathcal L_{\mathbf \theta, \mathbf \phi}(\mathbf x) &= \nabla_{\mathbf \theta} \, \mathbb E_{\mathbf z \sim q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} \left[ \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \right] \\ &= \nabla_{\mathbf \theta} \int q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } d\mathbf z \\ &= \int \nabla_{\mathbf \theta} \left( q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \right) d\mathbf z \\ &= \int q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \, \nabla_{\mathbf \theta} \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } d\mathbf z \\ &= \mathbb E_{\mathbf z \sim q_{\mathbf \phi} (\mathbf z \mid \mathbf x)} \left[ \nabla_{\mathbf \theta} \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \right] \\ &\approx \nabla_{\mathbf \theta} \left( \log p_{\mathbf \theta} (\mathbf x, \mathbf z) - \log q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \right) \\ &= \nabla_{\mathbf \theta} \log p_{\mathbf \theta} (\mathbf x, \mathbf z). \end{align*} $$

However, the same process can’t apply for the gradient with respect to \(\mathbf \phi\). The math makes it clear:

$$ \begin{align*} \nabla_{\mathbf \phi} \mathcal L_{\mathbf \theta, \mathbf \phi} (\mathbf x) &= \nabla_{\mathbf \phi} \, \mathbb E_{\mathbf z \sim q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} \left[ \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \right] \\ &= \nabla_{\mathbf \phi} \int q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } d\mathbf z \\ &= \int \nabla_{\mathbf \phi} \left( q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \right) d\mathbf z \\ &= \int \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \nabla_{\mathbf \phi} \, q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \, d\mathbf z + \int q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \nabla_{\mathbf \phi} \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } d\mathbf z \\ &= \int \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \nabla_{\mathbf \phi} \, q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \, d\mathbf z + \mathbb E_{\mathbf z \sim q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \left[ \nabla_{\mathbf \phi} \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \right] \\ &= \color{red}{ \int \log \frac{ p_{\mathbf \theta} (\mathbf x, \mathbf z) }{ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) } \nabla_{\mathbf \phi} \, q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \, d\mathbf z }\color{black} + \nabla_{\mathbf \phi} \log p_{\mathbf \theta} (\mathbf x, \mathbf z). \end{align*} $$

We can see that the first term in red, which didn’t exist in the previous derivation, is now present due to the total derivative. The distribution \(q_{{\mathbf \phi}} (\mathbf z \mid \mathbf x)\) we are sampling \(\mathbf z\) from depends on \(\mathbf \phi\), hence the expectation itself depends on \(\mathbf \phi\). The stochasticity in the sampling process thus presents a challenge in computing the derivatives to update the model parameters.

The reparameterization trick

The problem with the previous approach originates from the fact that \(\mathbf z \sim q_{{\mathbf \phi}}(\mathbf z \mid \mathbf x)\), making the sample completely stochastic. The reparameterization trick circumvents this problem by fixing a deterministic part in the sampling process and limiting the stochasticity to a random variable \(\mathbf \epsilon\) through a change of variables. To do so, we let \(\mathbf z = g(\mathbf x, \mathbf \epsilon)\), where \(g\) is a differentiable and invertible function and \(\mathbf \epsilon\) is sampled from a noise distribution \(h(\mathbf \epsilon)\). This trick makes the stochasticity to be contained only within the random variable, with everything else deterministic. This allows for the backpropagation in artificial neural networks (i.e. the gradient with respect to the model parameters). The diagram below illustrates this point decently:

Instead of working with \(\mathcal L_{\mathbf \theta, \mathbf \phi}(\mathbf x) = \mathbb E_{\mathbf z \sim q\_{{\mathbf \phi}}(\mathbf z \mid \mathbf x)} [\cdot]\), through a change of variables, we convert it to an equivalent problem:

$$ \mathbb E_{\mathbf z \sim q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} [ \log p_{\mathbf \theta}(\mathbf x, \mathbf z) - \log q_{\mathbf \phi} (\mathbf z \mid \mathbf x) ] = \mathbb E_{\mathbf \epsilon \sim h(\mathbf \epsilon)} [\log p_{\mathbf \theta}(\mathbf x, \mathbf z) - \log q_{\mathbf \phi} (\mathbf z \mid \mathbf x)]. $$

The calculation of the ELBO can still be done via Monte Carlo estimates, where we sample only the random variable \(\mathbf \epsilon_i \sim h(\mathbf \epsilon)\) instead of the sample itself:

$$ \mathbb E_{\mathbf \epsilon \sim h(\mathbf \epsilon)} [f(g(\mathbf x, \mathbf \epsilon))] \approx \frac{1}{L} \sum_{i=1}^L f(g(\mathbf x, \epsilon_i)). $$

Let’s take a simple but concrete example using factorized Gaussian posteriors (Kingma & Welling, 2013). Assume \(q_{\mathbf \phi}(\mathbf z \mid \mathbf x)\) is a multivariate Gaussian distribution with diagonal covariance \(\mathcal N(\mathbf z; \mathbf \mu_{\mathbf \phi} (\mathbf x), \text{diag}(\mathbf \sigma_{{\mathbf \phi}}^2)(\mathbf x))\) and is factorizable as a product of univariate Gaussian distributions:

$$ q_{\mathbf \phi} (\mathbf z \mid \mathbf x) = \prod_i q_{\mathbf \phi} (z_i \mid \mathbf x) = \prod_i \mathcal N(z_i; \mu_i, \sigma_i^2). $$

Instead of sampling \(\mathbf z\) directly, we sample \(\mathbf \epsilon \sim \mathcal N(0, \mathbf I)\). Using matrix notations, we let \(\mathbf z = \mathbf \mu_{\mathbf \phi}(\mathbf x) + \mathbf \sigma_{\mathbf \phi}(\mathbf x) \odot \mathbf \epsilon \), where \(\odot\) is element-wise product. \(\mu_{\mathbf \phi}\) and \(\mathbf \sigma_{\mathbf \phi}\) are parameters \(\mathbf \phi = (\mathbf \mu_{\mathbf \phi}, \mathbf \sigma_{\mathbf \phi})\) learned by the encoder, which is often an artificial neural network. In this example, \(\log q_{{\mathbf \phi}}(\mathbf z \mid \mathbf x)\) has a closed-form expression. Combining everything together, we obtain

$$ \begin{align*} \nabla_{\mathbf \phi} \mathcal L_{\mathbf \theta, \mathbf \phi}(\mathbf x) &= \nabla_{\mathbf \phi} \mathbb E_{\mathbf z \sim q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} \left[ \log \frac{p_{\mathbf \theta}(\mathbf x, \mathbf z)}{q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} \right] \\ &= \nabla_{\mathbf \phi} \mathbb E_{\mathbf \epsilon \sim \mathcal N(0, \mathbf I)} \left[ \log \frac{p_{\mathbf \theta}(\mathbf x, \mathbf z)}{q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} \right] \\ &= \nabla_{\mathbf \phi} \mathbb E_{\mathbf \epsilon \sim \mathcal N(0, \mathbf I)} \left[ \log p_{\mathbf \theta} (\mathbf x \mid \mathbf z) + \log p_{\mathbf \theta} (\mathbf z) - \log q_{\mathbf \phi}(\mathbf z \mid \mathbf x) \right] \\ &= \mathbb E_{\mathbf \epsilon \sim \mathcal N(0, \mathbf I)} \left[ \nabla_{\mathbf \phi} \log p_{\mathbf \theta} (\mathbf x \mid \mathbf z) + \nabla_{\mathbf \phi} \log p_{\mathbf \theta} (\mathbf z) - \nabla_{\mathbf \phi} \log q_{\mathbf \phi}(\mathbf z \mid \mathbf x) \right] \\ &\approx \frac{1}{L} \sum_{i=1}^{L} \left( \nabla_{\mathbf \phi} \log p_{\mathbf \theta} (\mathbf x \mid \mathbf z^{(i)}) + \nabla_{\mathbf \phi} \log p_{\mathbf \theta} (\mathbf z^{(i)}) - \nabla_{\mathbf \phi} \log q_{\mathbf \phi}(\mathbf z^{(i)} \mid \mathbf x) \right). \end{align*} $$

The key takeaway is that we can now perform backpropagation, enabling an end-to-end training.

REINFORCE

There is an alternative approach to the reparameterization trick. This approach, used in the policy gradient algorithm in RL on optimizing the expected reward, is by using the REINFORCE estimator. The core component of this approach is to use a log-derivative trick,

$$ \nabla_{\omega} \, p_{\omega} (x) = p_{\omega}(x) \nabla_{\omega} \log p_{\omega} (x), $$

which can be easily verified by simple rule of calculus, \(\nabla \log f(x) = (f(x))^{-1} \nabla f(x)\). If we denote \(f(\mathbf z)\) as our “score function” with no dependence on \( \mathbf \phi \) for which we wish to evaluate the gradient of its expectation, we have

$$ \begin{align*} \nabla_{\mathbf \phi} \mathbb E_{\mathbf z \sim q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} [f(\mathbf z)] &= \nabla_{\mathbf \phi} \int q_{\mathcal \phi} (\mathbf z \mid \mathbf x) f(\mathbf z) d\mathbf z \\ &= \int \nabla_{\mathbf \phi} \, q_{\mathbf \phi} (\mathbf z \mid \mathbf x) f(\mathbf z) d\mathbf z \\ &= \int q_{\mathbf \phi} (\mathbf z \mid \mathbf x) \nabla_{\mathbf \phi} \log q_{\mathbf \phi} (\mathbf z \mid \mathbf x) f(\mathbf z) d\mathbf z \\ &= \mathbb E_{\mathbf z \sim q_{\mathbf \phi}(\mathbf z \mid \mathbf x)} \left[ \nabla_{\mathbf \phi} \log q_{\mathbf \phi} (\mathbf z \mid \mathbf x) f(\mathbf z) \right] \\ &\approx \frac{1}{L} \sum_{i=1}^L \nabla_{\mathbf \phi} \log q_{\mathbf \phi} (\mathbf z^{(i)} \mid \mathbf x) f(\mathbf z^{(i)}) \end{align*} $$

where \(\mathbf z^{(i)} \sim q_{\mathbf \phi}(\mathbf z \mid \mathbf x)\) from \(i = 1, \ldots, L\). In the second equation, we used the Leibniz Integral Rule and the Monte Carlo estimates in the final equation. To summarize what we’ve done here, instead of estimating the gradient itself, we’ve used the REINFORCE estimator on the score function \(f(\mathbf z)\). The REINFORCE estimator using the log-derivative trick comes with both advantages and disadvantages. While it works for any distribution — even the discrete ones! — and nicely avoids the reparameterization (hence dispensing with the need to formulate a random variable sampling), it has a high variance, making the gradient estimates potentially noisy. In practice, it needs to couple with variance reduction techniques such as NVIL, VIMCO, etc.