Stat 295 2/26/09

Comment: Last time we said that the rejection rate is 1/c (for the smallest c). In the problem c=1 is optimal for n=y=5, but it’s not optimal for other data choices. Remarked again the optimal proposal distribution is the target distribution. For this problem y|p~bin(n,p) and \theta=log(p/(1-p)) so that p=1/(1+exp(-\theta)).

For prior N(0,.25) and y=n=5, \hat{p}=1. If we choose y=n=50, the acceptance rate will go way down; the evidence is strong that the coin is not fair, and the prior is not representative of the actual state…likelihood way far away of the proposal. So while it works in theory, in practice it’s not efficient and need to think more carefully.

One more comment about SIR. It’s easier to say than “weighted resampling.”

Result: Sample from posterior allows us to examine sensitivity to our assumptions. For example, sensitivity to the prior. If I want to change the prior to another, we can use SIR sampling using as weights the ratio of priors.

Have sample \{\theta_j^*\}_{j=1}^N, can obtain sample from posteriors that result from using another prior or if we left some data out. Procedure: if \{\theta_j^*\}_{j=1}^N is a sample from g_1(\theta|y) \propto f(y|\theta)g_1(\theta) and we wish to obtain a sample \{\tilde{\theta_j}^*\}_{j=1}^N from g_2(\theta|y) \propto f(y|\theta)g_2(\theta) we can resample the original sample using weights w_j=\frac{g_2(\theta_j)}{g_1(\theta_j)} and probabilities p_j=\frac{w_j}{\sum{w_j}}

Leaving data out: Same idea, but now use ratio of the two likelihoods for the weights.

Onward. Talked about how if dimension of \theta is large, these methods won’t work because it will be impossible to get an apropriate proposal density. This leads to MCMC (Markov chain Monte Carlo). [Note, that although the acronym for this is MCMC, all caps, the second word is not capitalized. This is because ‘Markov’ and ‘Monte Carlo’ are proper names, whereas ‘chain’ is just a word.] We discussed some of the history of the methods, which were developed during the bomb projects in the 1950s.

Nicholas Metropolis invented a key idea in this regard, although it wasn’t developed in image processing until the 1980s, and only in the late 1980s was it directed to statistical inference, in a seminal paper by Gelfand and Smith. Note also another paper by these two authors.

Jeff remarked that the name “Gibbs sampling” is very strange. It came from sampling on a distribution invented (I think) by the Yale physicist J. Willard Gibbs, who had nothing to do with how this sampling scheme works (he flourished well before computers, having died in 1903).

Gibbs Sampling:

Outline:

  1. Gibbs sampling algorithm
  2. Discrete probability example
  3. Normal model example
  4. (Maybe) Hierarchical model

Gibbs Sampling Algorithm: Let \theta=(\theta_1,\theta_2,...,\theta_d) be the parameter vector of interest where \theta_i is a subvector of \theta. It is simplest if the \theta_i‘s are all scalars, but actually they can be anything (e.g., vectors, matrices,…)

Suppose we know and can reasonably easily sample from the full conditional densities p_j(\theta_j|\theta_{-j},y), where \theta_{-j} is the vector \theta excluding the jth component.

For example, \theta_{-1}=(\theta_2,\theta_3,...,\theta_d) and \theta_{-3}=(\theta_1,\theta_2,\theta_4,...,\theta_d)

  1. Let \theta^0 be a vector of starting values.
  2. Sequentially generate
    \\ \theta_1^{t+1} \sim p(\theta_1|\theta_{-1}^t,y),
    \theta_2^{t+1} \sim p(\theta_2|\theta_1^{t+1},\theta_3^t,...,y),

    \theta_d^{t+1} \sim p(\theta_d|\theta_1^{t+1},\theta_2^{t+1},...,\theta_{d-1}^{t+1},y)
  3. Iterate step 2 to obtain a sequence of vectors \theta^1, \theta^2, ...
  4. Discard the first k iterations (burn-in) and use the remainder as a sample from g(\theta|y).

See this paper by Casella and George for a good introduction to the Gibbs sampling algorithm.

Specifying all the full conditionals gives a lot of information that allows us to overcome the curse of dimensionality. Note that replacing \theta_k^t by \theta_k^{t+1} when we sample \theta_{k+1}^{t+1} is essential, otherwise this is not a Markov chain. In a Markov chain, the next state is dependent only on the immediately preceding state. So, if we transition from a state depending on \theta_k^t to a state depending on \theta_k^{t+1}, then the next transition must depend on \theta_k^{t+1}, otherwise we are no longer dealing with a Markov chain.

Example: Suppose (\theta_1,\theta_2) is discrete, where \theta_1=1,2 and \theta_2=1,2,3. Suppose the conditionals f(\theta_1|\theta_2) are given by

\begin{tabular}{||l||c|c|c||}\hline\hline&\begin{math}\theta_2=1\end{math}&\begin{math} \theta_2=2 \end{math}&\begin{math}\theta_2=3\end{math} \\ \hline\hline \begin{math}\theta_1=1\end{math} & 1/3 & 2/3&3/4 \\ \hline \begin{math}\theta_1=2\end{math} & 2/3 & 1/3&1/4 \\ \hline\hline \end{tabular}

and the conditionals f(\theta_2|\theta_1) by

\begin{tabular}{||l||c|c||}\hline\hline&\begin{math}\theta_1=1\end{math}&\begin{math} \theta_1=2 \end{math}\\ \hline\hline \begin{math}\theta_2=1\end{math} & 1/6 & 1/2 \\ \hline \begin{math}\theta_2=2\end{math} & 1/3 & 1/4 \\ \hline \begin{math}\theta_2=3\end{math} & 1/2 & 1/4 \\ \hline\hline \end{tabular}

Note that we can compute the marginal distributions in terms of the conditionals (which we know) and the marginals of the other variable (which we don’t know…yet!) So, for example,

f_1(\theta_1=1) = \sum_{i=1}^3 f_1(\theta_1=1|\theta_2=i)f_2(\theta_2=i).

Also, f_2(\theta_2=1) = \sum_{j=1}^2 f_2(\theta_2=1|\theta_1=j)f_1(\theta_1=j), and similarly for f_2(\theta_2=2) and f_2(\theta_2=3).

Plug the second set of three equations into the first equation to get an expression for f_1(\theta_1=1) in terms of f_1(\theta_1=1) and f_1(\theta_1=2). We also have f_1(\theta_1=1)+f_1(\theta_1=2)=1 Let u_1=f_1(\theta_1=1) and u_2=f_1(\theta_1=2) Plug these in; we get two linear equations in two unknowns. The solution is u_1=.6, u_2=.4.

From these two we can get the remaining marginals f_2(\theta_2=1), etc. by plugging the u‘s into the three other equations above. Once we have the marginals, we need only multiply by the conditionals to recover the joint distribution. So we have shown that one can go from full conditionals to the marginals and then to the full joint distribution. In general, a similar argument works for any discrete distribution. It’s more complicated to prove this in the continuous case. Our result is

\begin{tabular}{||l||c|c|c||}\hline\hline&\begin{math}\theta_2=1\end{math}&\begin{math} \theta_2=2 \end{math}&\begin{math}\theta_2=3\end{math} \\ \hline\hline \begin{math}\theta_1=1\end{math} & 0.1 & 0.2&0.3 \\ \hline \begin{math}\theta_1=2\end{math} & 0.2 &0.1&0.1 \\ \hline\hline \end{tabular}

We ran the Gibbs sampler for this in R. Did it for 1000 and 10000 samples, 10000 was better (closer to the exact value)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: