## STAT 295 4/7/09

Remarks on the homework: It’s not supposed to be time consuming wrt coding, adapt last week’s and Jeff’s code. More it is an exercise on thinking.

Create an output from BUGS with hw9.sim=bugs(…). Then attr(hw9.sim) will tell you what’s in it. Another function, attributes(…) is similar. The function objects() will tell you what’s in your workspace.

We’ve just touched on the posterior predictive distribution (PPD) and its uses.

Bill asked if there is a principled method to choose a posterior p-value. In our example we were struck by the large residual and ginned up one of our discrepancy functions to test it. Would we have done that if we hadn’t been struck by this? What if we try multiple discrepancy functions? Should we deflate the posterior p-values with something like Bonferroni to account for that?

Jeff asked, what’s wrong with looking at that large residual, since we condition on the data anyway. But he agreed that large data sets will give more opportunity for some discrepancy function to display an anomaly. He noted that there is a huge menu of frequentist tests (e.g., for normality). One such test might be consistent with the data set, and another might reject it. Obviously, this is an area that should be (and is) a subject of active research.

[Added later]: I guess that the thing that bothered me about the particular example is that in Bayesian theory, you are supposed to ask your questions before looking at the data (i.e., what’s the likelihood? What are the priors? Now look at the data. Multiply, calculate, get posterior stuff). Looking at the data and seeing a large outlier and then asking the question seems to violate that principle. I don’t know the answer to this.

Today Jeff talked about Hierarchical models. It extends the two-stage (likelihood + prior) structure to a hierarchy of L levels. The joint distribution of the data and parameters is $f(\vec{y},\theta_1,\theta_2,...,\theta_L \mid \eta) = f(\vec{y} \mid \theta_1)g_1(\theta_1 \mid \theta_2)g_2(\theta_2 \mid \theta_3)...g_L(\theta_L \mid \eta)$

The notation suppressed the dependence on $\theta\mbox{'s}$ lower down in the hierarchy, i.e., $f(\vec{y} \mid \theta_1)$ means $f(\vec{y} \mid \theta_1,...,\theta_L,\eta)$ and $g_1(\theta_1 \mid \theta_2)$ means $g(\theta_1 \mid \theta_2,...,\theta_L,\eta)$, etc.

If we have $f(y \mid \theta_1)g(\theta_1|\theta_2)$ we could specify a value for $\theta_2$ OR we could put a prior on it, which has the effect of lessening the effect on the results of choosing a particular value for $\theta_2$.

Example 1: We already looked at the model $\begin{array}{rclr}y_i\mid\lambda&\sim&\mbox{Pois}(\lambda)\\ \lambda\mid\beta&\sim&\mbox{gamma}(2,\beta)\\ \beta&\sim&\mbox{gamma}(a,b)\end{array}$

with a and b specified by the statistician.

Hierarchical models could arise naturally by considering the kind of data you have. So if you are comparing hospitals, the data could be multi-level just because the data are structured that way.

Counties within state, schools within counties, teachers within schools, students of a teacher might also induce a hierarchical or multi-level structure.

Example 2:

Suppose we want to model the performance of trauma centers. We collect data on patients at n=40 different trauma centers.

Let $y_{ij}=\left\{\begin{array}{l}0\mbox{ live}\\1\mbox{ die}\end{array}\right.$

for jth patient at ith trauma center. So $\begin{array}{l}i=1,...,n\\ j=1,...,n_i\end{array}$

where n=94 (the number of hospitals). Note that the number of patients per hospital, $n_i$, varies from hospital to hospital. This means, in particular, that the hospitals with large patient loads will have outcomes that are closer to the mean for the particular hospital, and those with small patient loads will tend to regress (shrink) more to the mean over all hospitals.

Patient level model: $y_{ij}\mid p_{ij} \sim \mbox{bin}(1, p_{ij})$, where $\mbox{logit}(p_{ij})=\alpha_i+\beta^T x_{ij}$

Here, $x_{ij}$ is a vector of covariates (e.g., indicator variables like {male, female}, discrete variables such as age, continuous variables such as burn area) indicating the patient’s particular situation. $\alpha_i$ will represent the overall outcome for the individual hospital.

Note that we have $p_{ij}=\frac{1}{1+\exp(-(\alpha_i+\beta^T x_{ij}))}$

Now want to model the $\alpha_i\mbox{'s}$ because they will vary amongst the hospitals.

Hospital level model: $\alpha_i\mid \mu_\alpha, \sigma_\alpha \sim N(\mu_\alpha,\sigma_\alpha)$

The $\alpha\mbox{'s}$ are very interesting as they tell us how the probability of dying varies from hospital to hospital.

When Jeff started in statistics there was no hierarchical method to do this. He had to estimate the $\alpha\mbox{'s}$ directly.

We might even let the $\beta\mbox{'s}$ vary amongst hospitals (be a function of i). So if you did this and there were not much data on “gunshot wound to the toe” presentation, then they’d shrink to a point.

No pooling ( $\alpha\mbox{'s}$ estimated separately), complete pooling (all $\alpha\mbox{'s}$ equal). The Bayesian model with $\alpha_i$ is a compromise between these two extremes.

Amongst hospitals with lots of patients, the $\alpha\mbox{'s}$ will be different since the data for the individual hospital will dominate, but with relatively few patients, the $\alpha\mbox{'s}$ for an individual hospital will shrink towards the mean.

Priors: $\begin{array}{rclr}\beta & \sim & \mbox{flat}\\ \mu_\alpha & \sim &\mbox{flat}\\ \sigma_\alpha &\sim &\mbox{unif}(0,100)\end{array}$

for example.

Next: Heart Transplant Mortality Data

The book has a chapter on hierarchical model where this is analyzed. But it’s confusing. Jeff said that Bill’s notes have a better discussion. Jeff will do his own example.

Example: Heart Transplant Mortality Data

We have data on the number of observed deaths for n=94 hospitals. Also, for the ith hospital we have a value $d_i$ that represents the expected number of deaths for that hospital given that hospital’s case mix (e.g., age, smoking status,…). $y_i=$ obs # of deaths, i=1,…,94 $d_i=$ expected # of deaths based on case mix

Model: $\begin{array}{rclr}y_i\mid\lambda_i,\beta&\sim&\mbox{Pois}(\lambda_i d_i)\\ \lambda_i\mid\beta&\sim&\mbox{gamma}(2,\beta)\\ \beta&\sim&\mbox{gamma}(1,1)\end{array}$

This differs from the homework in that we have a $\lambda_i$ for each hospital, not just one $\lambda$ for all hospitals.

The posterior is proportional to $g(\lambda_1,...,\lambda_n,\beta\mid\vec{y}) \propto \prod_1^n{[f_{y_i\mid \lambda_i,\beta}g_{\lambda_i,\mid\beta}]}g_\beta$ $\propto \prod_1^n{[(d_i \lambda_i)^{y_i} \exp(-d_i \lambda_i) \beta^2 \lambda_i \exp(-\lambda_i \beta)]}\exp(-\beta)$ $\propto \prod_1^n{[\lambda_i^{y_i+1}\exp(-\lambda_i(d_i+\beta))]}\beta^{188}\exp(-\beta)$

What is full conditional posterior on $\lambda_1$? It is $\lambda_1\mid\lambda_2,...\lambda_n,\beta \sim \mbox{gamma}(y_1+2, \beta+d_1)$

or $\lambda_1\mid\lambda_{-1},\beta \sim \mbox{gamma}(y_1+2, \beta+d_1)$

and similarly for the other hospitals. Also, $\beta\mid\lambda_1,...,\lambda_n \sim \mbox{gamma}(189,1+\sum_1^{94}{\lambda_i})$

Comment: In R, you can sample all $\lambda\mbox{'s}$ in one statement because $\lambda_i\mid\lambda_{-i},\beta \sim \lambda_i\mid\beta$, i.e., the $\lambda\mbox{'s}$ are independent. So the alternate sampling that is implied by Gibbs sampling can be replaced by a single vector draw.

WinBUGS does all this for us of course. But sometimes you have to “roll your own.”

### 2 Responses to “STAT 295 4/7/09”

1. Jeff Says:

bill and everyone in class,

regarding the questions about model assessment bill is raising–i think the paper by r little in the american statistician, 2006 number 3, gives a good discussion of how a bayesian might assess his/her models.

if you have a look, let me know what you think.

jeff

2. bayesrules Says:

This is an interesting article, worth reading.

Bill