Stat 295, April 16, 2009

The lecture today continued on the topic of hypothesis testing, and Bill discussed some issues with frequentist hypothesis testing. There is some disagreement among analysts concerning how hypothesis testing should be carried out and also some confusion concerning interpretation of p-values. The issue is complex (else there wouldn’t be so much disagreement).

First, in most applications it is better to compute and report confidence/credible intervals, and avoid hypothesis testing altogether. In other applications, you need to decide whether it makes sense to construct a decision rule, or alternatively whether it makes sense to simply present the evidence against the null for the observed data. If the constructing a decision rule makes sense, then choose an alpha level, construct a rejection region and reject when the test statistic falls in the rejection region. Quality control applications often appropriately use this methodology–stop the process if it is ‘out of control’.

On the other hand, if you don’t want to make a decision between an alternative and a null, but simply want to quantify the evidence in the data against a null hypothesis (the alternative doesn’t even come into play), then compute a p-value. P-values are clearly defined to present evidence against the null. But don’t confuse p-values with type-I error rates.

Regarding interpretation of p-values, page 30 of the notes says

Then amongst those experiments rejected with p-values in (0.05 - \epsilon, 0.05) for small epsilon, at least 30% will actually turn out to be true, and the true proportion can be much higher (depends upon the distribution of the actual parameter for the experiments where the null is false)

This says that under these circumstances, the Type I error rate (probability of rejecting a true null), conditioned on our having observed p=0.05, is at least 30%!

I think one needs to think carefully about this claim. The type I error rate is the probability of rejecting a null given the null is true. The 30% referred to above is not a type I error rate because it considers p-values computed under alternatives, so I don’t think it is fair to compare the 30% to a type I error rate. They are not measuring the same quantity. Is it surprising that a large proportion of p-values in the indicated range would come from the null? I don’t think so, especially if the alternatives tend to be quite far from the null.

One Response to “Stat 295, April 16, 2009”

  1. bayesrules Says:

    I take Jeff’s point.

    Please note that Berger and Delampady have an extensive discussion of this (and other) points in their paper. See Section 4.5.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: