One can always try to use the relationships between pdfs for generating samples but often the distributions are not in a closed form. One way to use the theory of markov-chains to create samplers is to use the idea of "detailed balance". Detailed balance says that if we have a process where the forward movement along a path is equal to the backward movement along a path then the system is in a "dynamic equilibrium". Mathematically, DB holds for a markov chain if \( P_{ij} \pi_i = P_{ji} \pi_j \) where P_{ij} is the transition probability from state-i to state-j. We can prove that \( \pi \) is the stationary distribution because \[ \mathbf{P}\pi = \text{up}\Big(\sum P_{ij} \pi_i = \sum P_{ji} \pi_j = \pi_j \mid \forall j\Big) = \pi \]
Metropolis Hastings - MH is one way of creating a transition kernel that is in detailed balance with the required unnormalized distribution \( \tilde{\pi} \) and it adds the flexibility of chosing a conditional proposal distribution for proposing \( x \) given \( x' \) = \( Q(x | x') \) and then an acceptance process. Basically we want the proposal times the acceptance to be in detailed balance with the required stationary distribution, therefore the acceptance probability for jumping from state \( x' \text{ to } x \) has to be \( \min\Bigg(1, \frac{Q(x' \mid x) f(x)}{Q(x \mid x') f(x')}\Bigg) \)
Gibbs Sampling - The MH algorithm is one way of creating sampling algorithms based on detailed balance, but there is a lot of freedom in coming up with proposal distributions. For example, if we use the proposal distribution to be like the one in Gibbs Sampler then the acceptance ratio is guaranteed to be one. I.e. if \( x \) is actually multi-dimensional and we want to sample from the joint distribution of the dimensions/variables then we can just sample individual conditional distributions and we can directly prove that this chain works by proving the detailed balance conditions hold.
Samplers inspired from Hamiltonian Dynamics
Remember that all we need to do for MCMC is to come up with a chain that satisfies detailed balance. MH is one general method, and Gibbs Sampling is a nice-special case, but if we want something between general MH with its "naive" proposal distribution and Gibbs which may not always be possible. Samplers inspired from Hamiltonian-dynamics are a great technique for decreasing the sample complexity. The basic idea underlying hamiltonian-dynamics is to create a proposal distribution which has a really good acceptance probability and allows for quick exploration of the search space. Following is the pseudo-code in python
from random import randn, random
import numpy as np
def randn_like(q):
return randn(len(q))
def HMC(U, G, eps, q, M, L = 1):
"""
U := neg_log_pdf
G := gradient of neg log pdf
eps := The step-size in the leap-frog numerical integrator.
L := the number of steps to take in the HMC sampler.
M := The per-coordinate scaling values by which we will dampen the step-size.
q := the current position.
"""
assert q.shape == M.shape
assert q.ndim == 1
# sample p. This is the only step that changes the H value
# Note that 1) the division is element-wise, 2) we are taking a half-step
p = randn_like(q) / np.sqrt(M)
init_q = q.copy()
init_H = U(q) + (p*p).sum()/2
p = p - eps * G(q) / 2
for i in range(L):
q = q + eps * p / M # div is element-wise
if i < L-1:
p = p - eps * G(q)
else:
p = p - eps * G(q) / 2
p = -p
final_H = U(q) + (p*p).sum()/2
return q if np.log(random()) < init_H - final_H else init_q
|
|
HMC Limitations
- Ill-conditioned distributions that require preconditioning
- Multimodal distributions where it's hard to migrate between modes.
- Discontinuos distributions : Large energy gap from the initial sampling step reduces the acceptance rate.
- Spiky distributions : Large gradients reduces the acceptance rate.
- Large training dataset : Expensive gradient computation
The special-and-general case of Langevin Dynamics
Langevin dynamics can be narrowly understood be a specialization of MH where the algorithm above is run with L=1. But the above algorithm will require the exact gradient, later papers focused on getting rid of this requirement for the exact gradient of the loss.
- Stochastic Gradient Langevin Dynamics (SGLD) by Welling and Teh (ICML 2011)
This paper simply said use the above algorithm with L=1 but with the inexact minibatch gradients, and use a decreasing step-size that's inspired by the SGD algorithm, and add a noise with standard-deviation proportional to the square root of the step-size. The basic idea is that as time-goes on and the step-size becomes small enough, we want the injected noise to dominate the noise due to the minibatch gradient.
- Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring (use Fisher Information) by Sungjin, Korattikara, Welling (ICML 2012).
- Stochastic Gradient Hamiltonian Monte Carlo by Tianqi Chen, Fox, and Guestrin (ICML 2014)
This paper strictly generalizes the HMC algorithm by allowing for L > 1 and also allowing for noisy-minibatch-gradients. First they study the implications of implementing HMC using a stochastic gradient. They model the additional noise from the noisy-minibatch-gradients as follows
\(\displaystyle
\begin{align}
&d\theta = dq = M^{-1} r\ dt\\
&dr = dp = - \nabla U(\theta) dt \color{green}{+ \mathcal{N}(0, \epsilon V(\theta) dt)}
\end{align}\)
They show that the dynamics generated by the additional green component will inject additional variance so that the hamiltonian will not be conserved under these dynamics. To fix this issue they add an additional friction term to the dynamics so that the differential equations better conserve the hamiltonian.
\(\displaystyle
\begin{align}
&d\theta = dq = M^{-1} r\ dt\\
&dr = dp = - \nabla U(\theta) dt \color{green}{+ \mathcal{N}(0, \epsilon V(\theta) dt)} \color{blue}{-(\epsilon V(\theta)/2)M^{-1}rdt}
\end{align}\)
By appealing to standard results about the Fokker-Planck equation which governs the evolution of the probability distribution of a random variable governed by an SDE they prove that the blue-damping-term is able to compensate for the green-noise-term so that the stationary distribution of the dynamics is the negative-exp of the original hamiltonian. Specifically they make use of the following relations:
Let \(z\) be governed by the following SDE \(\displaystyle dz = g(z) dt + \mathcal{N}(0, 2D(z)dt)\). Let \(p_t\) be the distribution of \(z\) at time \(t\). Then
\(d p_t(z) / dt = - \nabla [g(z) p_t(z)] + \nabla [D(z) \nabla p_t(z)]\)
- Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget, Korattikara, Chen, and Welling (ICML 2014)