Stat 295, 4/9/09

We left off last time with model for heart transplant mortality data over 94 hospitals.

$y_i$ deaths
$d_i$ expected deaths for case mix
$y_i \mid \lambda_i,\beta \sim \mbox{Pois}(d_i \lambda_i)$

Want to estimate all the $\lambda\mbox{'s}$.

$\lambda_i \mid \beta \sim \mbox{gamma}(2,\beta)$
$\beta \sim \mbox{gamma}(1,1)$

The BUGS code and data were already posted, but Jeff has modified the code here. We ran the code and plotted the MLEs against the posterior means. We saw that they would be identical if they fell along the line x=y. We saw shrinkage. We noted that even though some hospitals had zero deaths, their shrunken estimators were pulled towards 1. The same thing happened for hospitals where $\lambda<1$ (their shrunken estimates were bigger) and $\lambda>1$ (their shrunken estimates were smaller

For hospital 9, although 1 is included in the credible interval, we can calculate the probability that its death rate is high by computing (in R): mean(lambda[,9]>1)=0.84, so the probability that it is $>1$ is pretty high.

A similar calculation (with between difference) can compute probability that hospital i is better than the best hospital (say #1): $Pr(\lambda_i>\lambda_1)$, say.

Mike asked: What is “significant”? Answer, it’s not really appropriate to think of this in terms of “statistical significance”, which is a frequentist notion. A Bayesian would more likely ask, what’s the decision-theoretic way to look at this? An insurance company and a hospital director might have different loss functions and therefore might take different actions upon learning this information.

In hierarchical models, we have to be careful about using improper priors. This is because you can get improper posteriors.

Classic example:

$y_i \sim N(\mu + \alpha_i, \sigma^2)$
$alpha_i \sim N(0, \sigma_\alpha)$
$\frac{1}{\sigma_\alpha^2} \sim \mbox{gamma}(\epsilon, \epsilon)$

As $\epsilon\rightarrow 0$, the posterior becomes improper.

There’s a discussion of this in Gull’s paper “Why I Am A Bayesian.” Look up “sermon on the spike” near the end.

Priors: Choosing a prior, or having to choose a prior, is perhaps the bane of being Bayesian. Or maybe the advantage of being Bayesian. (Uninformative might be the bane, said Jeff).

We can organize the discussion into three main themes: Informative, Noninformative, Conjugate.

1. Informative:

a) You have prior data and can use this to determine a prior. This might come from previous studies, e.g., if you analyze hospitals year to year.

b) Subjective or elicited priors.

2. “Noninformative”: (the holy grail, always out of reach).

a) Mathematical principles for choosing noninformative priors.

i) Jeffreys’ rule
ii) Maximum entropy
iii) Invariance

b) Reference priors: the idea is to find a prior that is “automatic” for a given model and convenient to use as a standard for a particular model.

3. Conjugate priors:

Mathematical convenience is the raison d’etre. The prior and posterior are in the same family. It seems that most if not all conjugate prior examples are in the exponential family of distributions.

We talked about noninformative priors. Jeff motivated the discussion with an example showing the difficulties.

Example:

$y \sim \mbox{bin}(n,\theta)$ and use as prior $g(\theta)=I(0<\theta<1)$

Suppose another analyst uses a different parameterization: Instead of using $\theta$, he uses $\gamma=\log\left(\frac{\theta}{1-\theta}\right)$ = log odds. In this case $-\infty<\gamma<\infty$, and $\theta=F(\gamma)$ where $F(\gamma)=(1+exp(-\gamma))^{-1}$

The prior in terms of $\gamma$, using the change of variables formula (involving the Jacobian) is

$g_1(\gamma)=\left| \frac{dF(\gamma)}{d\gamma}\right| g(F(\gamma))$

where the derivative is $F(\gamma)(1-F(\gamma))$.

The result is a bell-shaped curve with most of the mass between -4 and 4. In particular, it is not flat. The prior is no longer uniform.

The dilemma can be described as follows:

1. Two analysts, using the same probability model for the data but different parameterizations, and each using a uniform prior under their parameterization, will arrive at different posterior inferences. Thus, you can’t say “always use a uniform prior.”

2. On the other hand if we start with a uniform and go over the the other prior on transformation, the inferences will be the same. $\gamma=h(\theta)$ and use $g_1(\gamma)=\left| \frac{d h^{-1}(\gamma)}{d \gamma)}\right| g(h^{-1}(\gamma))$ get consistent posterior inferences but $g_1(\gamma)$ is no longer uniform.

2a)1) Jeffreys’ Rule. Harold Jeffreys gave a rule that specifies how to determine a prior after the probability model for the data, $f(y\mid\theta)$ has been specified. The rule has the property that two statisticians, each following Jeffreys’ rule and using the same probability model but different parameterizations, will get equivalent posterior distributions. In this sense, Jeffreys’ rule is invariant to reparameterization.

Jeffreys’ Rule:

We’ll consider the 1-dimensional case. Let $f(y\mid\theta)$ be the probability model for the data, and define the expected Fisher Information as follows:

$I(\theta)=E\left[\left(\frac{d \log f(y\mid\theta)}{d\theta}\right)^2\right] = -E\left[\frac{d^2 \log f(y\mid\theta)}{d \theta^2}\right]$

Jeffreys’ rule says to choose the prior density $g_J(\theta)$ so that $g_J(\theta) \propto \sqrt{I(\theta)}$

In the binomial case

$f(y \mid \theta)=C^n_y\theta^y(1-\theta)^{n-y}$

$\log f(y\mid\theta)=\log C^n_y + y \log( \theta) + (n-y)\log(1-\theta)$

But since $E[y]=n\theta$, routine calculation yields

$I(\theta)=\frac{n}{\theta(1-\theta)}$,

$g_J(\theta)\propto(\theta(1-\theta))^{-1/2}$

Note: $C^n_y=\left(\begin{array}{c}n\\y\end{array}\right)$