## Stat 295 2/26/09

Comment: Last time we said that the rejection rate is 1/c (for the smallest c). In the problem c=1 is optimal for n=y=5, but it’s not optimal for other data choices. Remarked again the optimal proposal distribution is the target distribution. For this problem y|p~bin(n,p) and $\theta=log(p/(1-p))$ so that $p=1/(1+exp(-\theta))$.

For prior N(0,.25) and y=n=5, $\hat{p}=1$. If we choose y=n=50, the acceptance rate will go way down; the evidence is strong that the coin is not fair, and the prior is not representative of the actual state…likelihood way far away of the proposal. So while it works in theory, in practice it’s not efficient and need to think more carefully.

One more comment about SIR. It’s easier to say than “weighted resampling.”

Result: Sample from posterior allows us to examine sensitivity to our assumptions. For example, sensitivity to the prior. If I want to change the prior to another, we can use SIR sampling using as weights the ratio of priors.

Have sample $\{\theta_j^*\}_{j=1}^N$, can obtain sample from posteriors that result from using another prior or if we left some data out. Procedure: if $\{\theta_j^*\}_{j=1}^N$ is a sample from $g_1(\theta|y) \propto f(y|\theta)g_1(\theta)$ and we wish to obtain a sample $\{\tilde{\theta_j}^*\}_{j=1}^N$ from $g_2(\theta|y) \propto f(y|\theta)g_2(\theta)$ we can resample the original sample using weights $w_j=\frac{g_2(\theta_j)}{g_1(\theta_j)}$ and probabilities $p_j=\frac{w_j}{\sum{w_j}}$

Leaving data out: Same idea, but now use ratio of the two likelihoods for the weights.

Onward. Talked about how if dimension of $\theta$ is large, these methods won’t work because it will be impossible to get an apropriate proposal density. This leads to MCMC (Markov chain Monte Carlo). [Note, that although the acronym for this is MCMC, all caps, the second word is not capitalized. This is because ‘Markov’ and ‘Monte Carlo’ are proper names, whereas ‘chain’ is just a word.] We discussed some of the history of the methods, which were developed during the bomb projects in the 1950s.

Nicholas Metropolis invented a key idea in this regard, although it wasn’t developed in image processing until the 1980s, and only in the late 1980s was it directed to statistical inference, in a seminal paper by Gelfand and Smith. Note also another paper by these two authors.

Jeff remarked that the name “Gibbs sampling” is very strange. It came from sampling on a distribution invented (I think) by the Yale physicist J. Willard Gibbs, who had nothing to do with how this sampling scheme works (he flourished well before computers, having died in 1903).

Gibbs Sampling:

Outline:

1. Gibbs sampling algorithm
2. Discrete probability example
3. Normal model example
4. (Maybe) Hierarchical model

Gibbs Sampling Algorithm: Let $\theta=(\theta_1,\theta_2,...,\theta_d)$ be the parameter vector of interest where $\theta_i$ is a subvector of $\theta$. It is simplest if the $\theta_i$‘s are all scalars, but actually they can be anything (e.g., vectors, matrices,…)

Suppose we know and can reasonably easily sample from the full conditional densities $p_j(\theta_j|\theta_{-j},y)$, where $\theta_{-j}$ is the vector $\theta$ excluding the jth component.

For example, $\theta_{-1}=(\theta_2,\theta_3,...,\theta_d)$ and $\theta_{-3}=(\theta_1,\theta_2,\theta_4,...,\theta_d)$

1. Let $\theta^0$ be a vector of starting values.
2. Sequentially generate
$\\ \theta_1^{t+1} \sim p(\theta_1|\theta_{-1}^t,y)$,
$\theta_2^{t+1} \sim p(\theta_2|\theta_1^{t+1},\theta_3^t,...,y)$,

$\theta_d^{t+1} \sim p(\theta_d|\theta_1^{t+1},\theta_2^{t+1},...,\theta_{d-1}^{t+1},y)$
3. Iterate step 2 to obtain a sequence of vectors $\theta^1, \theta^2, ...$
4. Discard the first k iterations (burn-in) and use the remainder as a sample from $g(\theta|y)$.

See this paper by Casella and George for a good introduction to the Gibbs sampling algorithm.

Specifying all the full conditionals gives a lot of information that allows us to overcome the curse of dimensionality. Note that replacing $\theta_k^t$ by $\theta_k^{t+1}$ when we sample $\theta_{k+1}^{t+1}$ is essential, otherwise this is not a Markov chain. In a Markov chain, the next state is dependent only on the immediately preceding state. So, if we transition from a state depending on $\theta_k^t$ to a state depending on $\theta_k^{t+1}$, then the next transition must depend on $\theta_k^{t+1}$, otherwise we are no longer dealing with a Markov chain.

Example: Suppose $(\theta_1,\theta_2)$ is discrete, where $\theta_1=1,2$ and $\theta_2=1,2,3$. Suppose the conditionals $f(\theta_1|\theta_2)$ are given by

$\begin{tabular}{||l||c|c|c||}\hline\hline&\begin{math}\theta_2=1\end{math}&\begin{math} \theta_2=2 \end{math}&\begin{math}\theta_2=3\end{math} \\ \hline\hline \begin{math}\theta_1=1\end{math} & 1/3 & 2/3&3/4 \\ \hline \begin{math}\theta_1=2\end{math} & 2/3 & 1/3&1/4 \\ \hline\hline \end{tabular}$

and the conditionals $f(\theta_2|\theta_1)$ by

$\begin{tabular}{||l||c|c||}\hline\hline&\begin{math}\theta_1=1\end{math}&\begin{math} \theta_1=2 \end{math}\\ \hline\hline \begin{math}\theta_2=1\end{math} & 1/6 & 1/2 \\ \hline \begin{math}\theta_2=2\end{math} & 1/3 & 1/4 \\ \hline \begin{math}\theta_2=3\end{math} & 1/2 & 1/4 \\ \hline\hline \end{tabular}$

Note that we can compute the marginal distributions in terms of the conditionals (which we know) and the marginals of the other variable (which we don’t know…yet!) So, for example,

$f_1(\theta_1=1) = \sum_{i=1}^3 f_1(\theta_1=1|\theta_2=i)f_2(\theta_2=i)$.

Also, $f_2(\theta_2=1) = \sum_{j=1}^2 f_2(\theta_2=1|\theta_1=j)f_1(\theta_1=j)$, and similarly for $f_2(\theta_2=2)$ and $f_2(\theta_2=3)$.

Plug the second set of three equations into the first equation to get an expression for $f_1(\theta_1=1)$ in terms of $f_1(\theta_1=1)$ and $f_1(\theta_1=2)$. We also have $f_1(\theta_1=1)+f_1(\theta_1=2)=1$ Let $u_1=f_1(\theta_1=1)$ and $u_2=f_1(\theta_1=2)$ Plug these in; we get two linear equations in two unknowns. The solution is $u_1=.6, u_2=.4$.

From these two we can get the remaining marginals $f_2(\theta_2=1)$, etc. by plugging the $u$‘s into the three other equations above. Once we have the marginals, we need only multiply by the conditionals to recover the joint distribution. So we have shown that one can go from full conditionals to the marginals and then to the full joint distribution. In general, a similar argument works for any discrete distribution. It’s more complicated to prove this in the continuous case. Our result is

$\begin{tabular}{||l||c|c|c||}\hline\hline&\begin{math}\theta_2=1\end{math}&\begin{math} \theta_2=2 \end{math}&\begin{math}\theta_2=3\end{math} \\ \hline\hline \begin{math}\theta_1=1\end{math} & 0.1 & 0.2&0.3 \\ \hline \begin{math}\theta_1=2\end{math} & 0.2 &0.1&0.1 \\ \hline\hline \end{tabular}$

We ran the Gibbs sampler for this in R. Did it for 1000 and 10000 samples, 10000 was better (closer to the exact value)