## Stat 295, 3/19/09

We put up a chart with sufficient conditions for a Markov chain to have as its limiting distribution the target distribution we are interested in. Briefly, the chain (1) must be able to access every state from every other state in a finite number of steps, with nonzero probability; (2) must not contain periodic states (that repeat after a given number of steps); (3) must not shrink.

We showed that if the proposal distribution is $q(j \mid i)$ for any two states $i,j$, then we satisfy the detailed balance condition by choosing $M_j^i=q(j|i)\alpha_j^i$ where

$\alpha_j^i=min\left\{1,\frac{q(i \mid j)\phi_j}{q(j \mid i)\phi_i}\right\}$

The procedure for drawing a sample from the posterior distribution becomes:

1. Start at an arbitrary point $i$ in the state space $S$
2. Generate a random variable $j$ from the arbitrary but fixed proposal distribution $q(j \mid i)$. This is a proposed move to a new state.
3. Calculate $\alpha_j^i$ according to the above prescription.
4. Generate a $U(0,1)$ random variable $u$.
5. If $u<\alpha_j^i$, then accept $j$ as the new state (the chain moves); if not, keep $i$ as the new state (the chain doesn’t move)

Repeat steps 1-5 until a sufficiently large sample is obtained.

We remarked that it is very important not to discard $i$ if the chain doesn’t move. It is still part of the sample, and if you were to discard it, you would not be sampling from the true posterior distribution. It is possible, and indeed happens frequently, that a chain will not move on every try but will remain in the same position. Those “repeats” are still part of the chain and must be kept. Jeff remarked that when he was learning the Metropolis-Hastings method, he wrote a program that discarded these steps, and was puzzled when the results did not agree with what he knew to be the correct answer. It was only when he fixed the program to keep the repeats that the results were as expected.

(Parenthetically, when you are doing SIR, it is also possible, and expected, that some points will be resampled more than once, since you are sampling with replacement. It would be as wrong to discard those repeats as it is to discard repeats when doing Metropolis-Hastings sampling.)

I remarked that when I actually code the calculation, I omit computing the minimum. It’s not necessary, since if the fraction is greater than 1, you’ll always accept, even if you compare $u$ with the fraction.

Jeff and I warned that it is essential to calculate using logs, and in particular, not to calculate the likelihood and then take the log, but rather to calculate the log likelihood directly, using tricks such as the fact that many of the densities in R have a parameter ‘log’ which, if set to be TRUE (T) will directly give you the log of the density. So, ‘y=dnorm(x,mean=a,sd=sigma,log=T)’ gives you the log of the normal density at x. Obviously, then, you will calculate the log of $\alpha_j^i$ by adding and subtracting the logs of the priors and likelihoods, and compare it to the log of $u$ in deciding whether to accept or reject the proposed move.

We ran the simple program in the notes. We found that it’s important with this kind of symmetric proposal (that is, the proposal density is an even function, centered on the current position), that the density not try proposed steps that were too far from the current position (otherwise the sampler will remain for long periods at the same point before it eventually proposes a step that will be accepted, because the sampler would propose a lot of points in the tails of the target distribution, which will probably be rejected since the $\alpha_j^i$ will be very small. Similarly, we found that if the steps proposed were all small because the width of the proposal distribution is to small, then most steps will be accepted, but the sampler will take a long time to get anywhere. Typically, you want the proposal distribution for this kind of sampler to have a width that is comparable to several standard deviations of the posterior distribution.

We found that if we started way off of the peak of the posterior distribution, it may take quite a while before the sampler moves into a region that mixes well. This is why we discard the beginning of the sample…the “burn-in”. We also pointed out that a good technique is to run several different chains, starting at different points, to see if the inferences we get are similar. This is a good sanity check.

Finally, we remarked that the larger the sample, the better the inferences from the sample will be. We noted that we are actually using frequentist ideas when we calculate means, medians, standard deviations, quantiles, etc., from the sample. This may be regarded by some as ironic. We also noted that the precision of the inferences will improve with sample size $N$ proportionally to $\sqrt{N}$, so there is a point where taking more and more samples doesn’t improve things much. The standard deviation of our estimate of the posterior mean, for example, only improves by a factor of about 3 when we take 10 times as many samples, and 10 if we take 100 times as many samples. I also pointed out that taking more and more points doesn’t change things like the standard deviation of the posterior distribution; but it does improve our estimate of that standard deviation.