I started by returning briefly to the question of the prior on guilt or innocence that we discussed last time, but from a different point of view. In particular, the reason why we cannot take the Grand Jury’s indictment as a statement that the probability of guilt is at least 50% is that you cannot use data twice in a legitimate Bayesian calculation. Thus, the jury would be hearing the same evidence that has already been used to indict, and a correct Bayesian calculation will not allow the data to be used again. Here’s why: The first use of evidence E in a calculation is shown in the following spreadsheet (for this guilt/innocence example, but it is generally true):

Similarly, one can look at the explicit calculation of the same thing via Bayes’ theorem, which can be gotten by looking at the spreadsheet and doing the calculation explicitly:

If we try to use evidence E a second time, with P(G|E) and P(I|E) as the prior, Bayes’ theorem requires us to condition the likelihood on E, since that’s what the conditional probability formula says. So Bayes’ theorem applied a second time with the same evidence E looks like the top of the following whiteboard snapshot:

But P(E|G,E)=1! The probability that we observed E, given that we observed E is 1 because if we observed E, then it is a sure thing that we observed E. Similarly, P(E|I,E)=1, so that the denominator simplifies to just P(G|E)+P(I|E), which is, according to our spreadsheet, exactly 1 as well. So Bayes’ theorem tells us that P(G|E,E)=P(G|E), and our attempt to use the data a second time hasn’t changed the posterior probability.

Another way to think about this (not mentioned in class) is that P(G|E,E) is the probability of guilt given that we observed evidence E and that we observed evidence E. But {“we have observed E” and “we have observed E”} is logically equivalent to {“we have observed E”}. Saying it twice doesn’t change anything. So that means that on the right hand of the conditioning bar |, {E,E}={E} and so P(G|E,E)=P(G|E).

We then turned our attention to the O. J. Simpson case. You were all pretty young when this happened so I’ll remind you. O. J. Simpson was a famous football player who was married to Nicole Simpson, but they were separated. Their relationship was abusive; O. J. had frequently beaten Nicole, and the police had been brought in on numerous occasions.

O. J. had a lawyer, Alan Dershowitz, who is a professor of law at Harvard. On one occasion he made a remark to the press (a remark that never was mentioned at the trial). The remark was intended to make it appear less likely that O. J. was the murderer; but the actual effect was the opposite, as the statistician Jack Good showed. Dershowiz’ remark was that in any given year, only one in 2500 men who abuse their partners (batterers) goes on to murder their partner. This was intended to make such an event look very unlikely.

However, Jack Good learned that in any given year, only one in 20,000 women is murdered by a random stranger. This is even more unlikely. Now, O. J. had been arguing that the real murderer was a random stranger.

We can easily analyze this with a natural frequency tree, where we have three branches: One with women murdered by their partner, one with women murdered by a stranger, and one where the women are alive at the end of the year. Using a sample of 100,000 women, we find that 40 of the murdered women are murdered by their abusive partner, 5 are murdered by a stranger, and the rest are still alive.

Looking at those who were murdered, we find that the odds are 8:1 that a woman who is murdered was murdered by her abusive partner. So, this is evidence against O. J.’s innocence, not in favor of it. Using the formula we derived last time, that probability=odds/(1+odds), we find the probability of guilt from this particular piece of evidence is approximately 0.9 (and there was lots of other evidence as well, although the jury found Simpson innocent).

In the remaining 15 minutes I discussed how we can use evidence to determine whether a coin is fair or not, using Bayesian methods (Problem 2 in the problem set due today). This is one case where Bayesian and standard statistical methods are very different and give quite different answers. In standard statistics, we assume the truth of the “null hypothesis” of “fair” so that P(H)=P(T)=0.5 for a tossed coin, for example.

The probability of observing the exact data we observed, say 60 H an 40 T, is given by a binomial distribution, as on the whiteboard shot below, where _{100}C^{60} is the “choose” function, the number of different ways that you can select 60 objects out of 100 distinct objects. (We’ve seen this before). (Before we got to the binomial distribution, we had other suggestions, such as normal and chi-square; but these aren’t appropriate; for one thing they both have an infinite tail, and we know that when we toss a coin we get at least 0 heads and at most 100 heads, so infinite tails aren’t appropriate).

Here’s a smooth normal-like distribution. If we were testing a hypothesis using standard methods, we assume the null hypothesis (coin is fair) to be true, and then look at where the data lie. We then add up the probability in the tails, including not only the data we did observe (60 heads and 40 tails), but also all data even more extreme than that (including, perhaps the left-hand tail in this case of 40 heads and 60 tails, and all cases with less than 40 heads).

But a Bayesian answer isn’t going to depend on data that we haven’t observed, only on the data we have observed. So the Bayesian answer is going to depend only on the 60 heads and 40 tails data that we did observe, and not on the more extreme data that we did not observe. If you did Problem 2 correctly, you’ve already done this. We’ll discuss this more next time.

## Leave a Reply