## Stat 295 2/24/09

I’m doing a lot of this from memory, because a glitch destroyed my first draft of the first half of today’s lecture. If I left out anything important, please post a comment!

Jeff started by recapping and motivating the rejection sampling technique. Recall that the idea is to have an enveloping density p(θ) that is easy to sample from and which has the property that there exists a constant c such that f(y|θ)g(θ)≤cp(θ) for all θ. Then, to generate a sample, we do the following:

1) Draw θ* from the density p(θ) and u~U(0,1)
2) Check if u≤f(y|θ*)g(θ*)/cp(θ*); if it is, set θ=θ* as the sample; if not, discard θ* and return to step 1.
3) Repeat 1-2 until a sufficiently large sample is obtained.

Jeff then tried to motivate the idea that we ought to choose a proposal that is as close in shape to f(y|θ)g(θ) as possible. (See picture below). Anyway, he asked us to consider the proposal distribution as θ~N(0,1), and let that be the prior as well, but with a data set consisting of 100 samples drawn from y~N(5,1) so that the likelihood is peaked sharply near $\bar{y}$. In this case the prior (and proposal) are ill-matched to the posterior distribution. The reason is that we will seldom propose values that are way out in the tails of p(θ) where the likelihood is. Most of the values of θ* that are proposed will be near 0, where p(θ) is large and f(y|θ) is small, so the ratio for the acceptance test will usually be very small, and the uniform u that is generated will almost always be larger than that ratio, so you’ll hardly ever accept θ*. Here, the prior/proposal is the black curve and the likelihood is the red curve. We see that relatively few of the proposed points will lie near where the likelihood, and hence the probability of acceptance, is large. So a large proportion of proposed points will be rejected, and the sampling scheme will be relatively inefficient.

Jeff didn’t mention this, but suppose we had used as a proposal the actual posterior distribution. If this had been the case, then the ratio would have been constant! By picking c so that the ratio is 1 (i.e., choosing c=m(x)) then every single proposed θ* would be accepted! That is perfect sampling.

The next important concept we discussed is Importance Sampling. This is an effective way to calculate expectations. So, for example, we may want to compute E[h(θ|y)] for some function h(), where the expectation is taken over the posterior distribution of θ. So, our aim is to compute $E[h(\theta)]=\frac{\int{h(\theta)f(x|\theta)g(\theta)d\theta}}{\int{f(x|\theta)g(\theta)d\theta}}$

As before, we need a proposal distribution p(θ), but it doesn’t have to be an enveloping distribution (that is, no c need exist that causes the inequality to be satisfied for all θ). However, the support of p(θ) does have to contain the support of the posterior distribution.

The method exploits the fact that if h(θ) is any function, and p(θ) is the density of the random variable θ, then $E[h(\theta)]=\int{h(\theta)p(\theta)d\theta}\approx\frac{1}{M}\sum_{j=1}^M{h(\theta_j)}$

where $\theta_1,\theta_2,...,\theta_M$ is a large sample drawn from p(θ).

The procedure is as follows:

1) Generate a large sample {θj} from p(θ)
2) Compute $w_j=\frac{h(\theta_j)f(y|\theta_j)g(\theta_j)}{p(\theta_j)}$
2) Compute $E[h(\theta)] \approx \frac{\sum_{j=1}^M{h(\theta_j)w_j}} {\sum_{j=1}^M{w_j} }$

Jeff imagined using a uniform distribution for p(θ) (for this to work, the posterior must have compact support, and the uniform must include the support of the posterior but must not extend to infinity…we can’t sample from an infinite uniform!) The problem with doing this is that if the posterior distribution is highly peaked, then most of the samples will be taken where the posterior distribution is small, and in the sum above most of the weights wj will be small.

On the other hand, if we were able to sample from the actual posterior distribution, then all the weights would be the same, and all samples would contribute equally.

We then went over to SIR (Sampling Importance Resampling), which Jeff prefers to call Weighted Resampling. Whatever you call it, it can be an effective way to get an approximate sample from the posterior distribution. Why it is approximate will become evident in due course.

The idea is this:

1) Generate a large sample {θj} from a proposal distribution p(θ).
2) Compute weights at each sample point equal to f(y|θ)g(θ)/p(θ)
3) Divide each weight by the sum of all weights to get a probability at each sample point.
4) Generate the desired sample from the posterior distribution by sampling (with replacement) with probabilities equal to the probabilities computed at each point in step 3. This is done by sampling the vector of indices (1,2,…,N) that correspond to the j’s in the sample generated in step 1.

The w’s are the importance weights. Actually, if you use the R function ‘sample’, you can use the weights directly without bothering to normalize them as in step 3. But Jeff prefers to include step 3 for pedagogical reasons, and to emphasize the fact that we are sampling with a particular set of probabilities.

We noted that a SIR sample is not a sample from the actual posterior distribution (rejection sampling is), since it is actually a sample from the discrete distribution over the points in the original sample. However, if the number of points in the original sample is large, the result can be a very good approximation, and it will approach a sample from the actual posterior distribution as the number of points in the original sample →∞

Remarks:

1. SIR does not need a dominating constant c as in rejection sampling.

2. The resulting sample only approximates the posterior. The approximation improves with larger N and p(θ) is close to f(y|θ)g(θ)/m(x)

3. It’s essential to have enough points in initial sample to cover region of interest for f(y|θ)g(θ).

4. When the dimension of θ is large, it becomes impossible to find an adequate p(θ), and we must resort to MCMC methods.

5. When using R’s ‘sample’ function, it’s sufficient to write “prob=w”; you don’t actually have to divide by the sum of the w’s…but Jeff prefers, for pedagogical and other reasons, to do this extra step.

What happens if p(θ) doesn’t well approximate f(y|θ)g(θ)?

The weights are ratios. They will be small where proposal is big and the posterior is small. They will be large where proposal is small and weights are large; but there will be relatively few samples (because the proposal is small) there and the discrete approximation will be bad (points not dense there). We would need m to be huge for this to work.

On the other hand, if the proposal is a good approximation to the actual posterior, then the weights will be about equal and the sampling will be efficient over the set of sampled points.

Rejection and SIR don’t work well if the number of dimensions is large.

c=1 is the bounding constant for the homework problem. (You will need to prove this when you do the problem).

ALWAYS USE LOGS! The reason is that calculating the posterior instead of the log posterior will often result in machine underflows and/or overflows, where the number computed is either less than the smallest positive number that the machine can express, or larger than the largest that the machine can express, which can (silently, so that you do not know) introduce errors in the result. Thus, in Jeff’s calculation of the weights for SIR, he actually computes the log of the proposal and the product f(y|θ)g(θ), and subtracted the log of the maximum before exponentiating. This prevents these undesirable things from happening.