## Stat 295, April 16, 2009

The lecture today continued on the topic of hypothesis testing, and Bill discussed some issues with frequentist hypothesis testing. There is some disagreement among analysts concerning how hypothesis testing should be carried out and also some confusion concerning interpretation of p-values. The issue is complex (else there wouldn’t be so much disagreement).

First, in most applications it is better to compute and report confidence/credible intervals, and avoid hypothesis testing altogether. In other applications, you need to decide whether it makes sense to construct a decision rule, or alternatively whether it makes sense to simply present the evidence against the null for the observed data. If the constructing a decision rule makes sense, then choose an alpha level, construct a rejection region and reject when the test statistic falls in the rejection region. Quality control applications often appropriately use this methodology–stop the process if it is ‘out of control’.

On the other hand, if you don’t want to make a decision between an alternative and a null, but simply want to quantify the evidence in the data against a null hypothesis (the alternative doesn’t even come into play), then compute a p-value. P-values are clearly defined to present evidence against the null. But don’t confuse p-values with type-I error rates.

Regarding interpretation of p-values, page 30 of the notes says

Then amongst those experiments rejected with p-values in $(0.05 - \epsilon, 0.05)$ for small epsilon, at least 30% will actually turn out to be true, and the true proportion can be much higher (depends upon the distribution of the actual parameter for the experiments where the null is false)

This says that under these circumstances, the Type I error rate (probability of rejecting a true null), conditioned on our having observed p=0.05, is at least 30%!

I think one needs to think carefully about this claim. The type I error rate is the probability of rejecting a null given the null is true. The 30% referred to above is not a type I error rate because it considers p-values computed under alternatives, so I don’t think it is fair to compare the 30% to a type I error rate. They are not measuring the same quantity. Is it surprising that a large proportion of p-values in the indicated range would come from the null? I don’t think so, especially if the alternatives tend to be quite far from the null.