Archive for March, 2011

HCOL 196, March 30, 2011

March 31, 2011

We have chosen FRIDAY, APRIL 15, as the date for the second (and last) quiz

You might be interested in this paper, which I wrote about the parapsychology experiment we have been discussing. The math in this paper does not go beyond what we have already seen in the course, so you should be able to understand it without difficulty.

I first drew a picture that shows why the null hypothesis is favored when the data (blue likelihood) are near the null (when compared to a complex alternative), and the alternative hypothesis is favored when the data (red likelihood) are far from the null. The product of prior times likelihood for each hypothesis is the amount by which each hypothesis is favored. Also, the figure shows that if you go from a flat prior to one that is peaked near the data, that will favor the alternative (complex) hypothesis.

We considered an alternative hypothesis where the amount of bias is exactly where the data say they are. This is in the context of the parapsychology experiment, where the data are at 0.500176, a minuscule difference from 0.5 by less than 0.02%. Such a small bias, even if real, has no practical significance. You could not use it to beat the casino, for example, or to make lots of money on the stock market.

This prior on the alternative is the most favorable one to the alternative that you can devise; yet the p-value is five times smaller than the Bayes factor (which measures the evidence in favor of the alternative hypothesis). This is one reason why I believe that p-values are unreliable measures of the strength of evidence against the null.

(To do the calculation, we had to compute the log of the Bayes factor and then exponentiate it. Just multiplying out is beyond the capabilities of a calculator.)

Jim Berger has an “objective” prior that has the following properties: Consider all priors that decrease (or do not increase) as you go away from the null. Calculate the posterior for each of these priors, and choose the one that most favors the alternative. The prior that results is flat up to the data, and then zero (and is symmetrical about the null). For this prior, the Bayes factor is more than 20 times the p-value. This again shows how unreliable a p-value is as a measure of the evidence against the null.

Finally, I argued that observed p-values aren’t actually probabilities at all. Although they are calculated as the probability that in some sequence of experiments (that have not been performed) one would get results at least as extreme as the one we actually observed, the calculation shows (for flipping a fair coin 101 times and getting 61 heads, then repeating the experiment and happening to get 61 heads again…these numbers are just for illustration and the problem would arise no matter what the outcome of the experiments were) that the product of two p-values for independent experiments is NOT the p-value for the combined experiment, as they must be if p-values were probabilities (obeyed the laws of probability for independent events). The product of the two p-values is 0.0021, but the combined p-value is much larger, 0.038!


Relevant article

March 29, 2011

You will recall that we discussed cheating on multiple-choice tests some time ago.

Here is an interesting article on a different kind of possible cheating, discovered by statistical analysis of the paper forms used to test children for progress in the Washington, D. C. school district.

HCOL 196, March 28, 2011

March 28, 2011

Today I introduced the idea of Ockham’s razor (sometimes spelled ‘Occam’, which is the Latinized version). It dates to the 14th century. It has been expressed in various ways by it’s inventor, William of Ockham. Two formulations that can be attributed to him are “Plurality must never be posited without necessity,” and “It is futile to do with more things that which can be done with fewer.” (These would have been written in Latin, of course.) These days, scientist interpret it to mean that we should use hypotheses that are just complex enough to explain the phenomena we want to explain, but no more complex.

For example, we determined that the hypothesis of a fair coin is simpler than the hypothesis of a biased coin, where the amount of the bias is unknown. So if the data we get from coin flips is explained pretty well if the coin is fair, we would probably not opt for the more complicated, biased coin experiment. This is in fact what happened in the case of the parapsychology experiment we discussed last time. What happened there is that the hypothesis “fair, no psi” makes a bold prediction, whereas the “biased, psi” hypothesis spends a lot of prior probability on values of the bias that are far away from the observations. The net effect is that even though the p-value was quite small, the Bayes factor was still supportive of the “fair, no psi” position.

It is like going into a casino and putting money on the roulette wheel. If you put all your money on one number, and it wins, you’ll make a bundle. But your winning is a low probability event. On the other hand you could almost guarantee winning something by putting a small amount of money on every number that you can. You’ll almost certainly win, but your reward will be small because you had to divide your money amongst a lot or possibilities.

The roulette wheel, and the parapsychology experiment are similar in this way: Putting all your money (or prior probability) on one number (or one simple hypothesis) results is a huge reward (in money or posterior probability) if the outcome is predicted by your bet; but if you spread your money (or prior probability) around over many possibilities, even though you’ll get some money (or posterior probability) with near certainty, it won’t beat the successful, bold player (or hypothesis) that bets all the money (or prior probability) on the simple outcome.

I spent most of the period discussing Einstein’s general theory of relativity. It is a theory of gravity that is considered as a replacement for Newtonian theory. It makes a number of predictions, and we will consider two of them. The first is that gravity will bend the path of light, acting as a sort of lens. In the figure below, the path of light from a distant star that goes past the Sun appears farther away from the Sun (as indicated by the red line) than it actually is. The amount of bending, according to Einstein, for a ray that just grazes the surface of the Sun is 1.75 seconds of arc (written 1.75″). This is twice the amount predicted by Newton’s theory.

The British astronomer Eddington mounted an expedition in 1919 to try and detect this effect; they took a photograph of a field of stars when the Sun was on the other side of the Earth, and another photograph during an eclipse of the Sun when the Sun was in that field of stars (with the light of the Sun, which would otherwise drown out the stars, eclipsed by the Moon). Laying the two photographs on top of each other, we would see stars moved away from the Sun, with greater motion for the closer stars, in the eclipse photograph relative to the non-eclipse photograph. Eddington reported that the motion was consistent with Einstein’s theory; we now know that his observations were not so accurate as he thought (this is technically a very difficult experiment). But today, using very accurate radio telescopes, we can make these observations so accurately that it is clear that they are inconsistent with Newton’s theory and consistent with Einstein’s.

The other famous observation concerns the motion (precession) of the perihelion (closest point in an orbit to the Sun) of Mercury. It had been known since 1859 that this motion was inconsistent with what was known at the time about planetary orbits. It wasn’t that Newtonian theory couldn’t explain the motion, but it needed something more than the planets that were known, or some modification of the law of gravity. The situation is shown in the figure below:

One could propose explanations that would solve the problem. There could be an unknown planet close to the Sun and hard to observe, for example. Other planets had been discovered recently, Uranus (by accident) and then Neptune (by its effect on the orbit of Uranus). Surely, we ought to be able to repeat the discovery of Neptune on a planet close to the Sun, some astronomers thought! And some astronomers claimed to have seen the elusive planet, and even gave it a name, “Vulcan,” the god of the furnace. But these discoveries were never confirmed.

Other possibilities would be a faint ring of material near the Sun, or the Sun having a slight oblateness (elliptical cross-section). Yet another would be some subtle change in the law of gravity.

All of these explanations involve an adjustable parameter that can be chosen to match the observed precession. For example, the mass of Vulcan, the mass of the ring, the amount of the oblateness of the Sun, or the amount of the change in the law of gravity that would be required.

In the figure, we illustrate this. We can be sure that in the complex theories that involve an adjustable parameter, such as modifying the law of gravity, the adjustable parameter has to be such that the effect on Mercury’s orbit is not greater than 100″/century either way (positive or negative). The reason we know this is that if it were greater, we would see effects on other planets (Venus and Earth) that are not seen. So the prior probability is spread out under this theory over a wide range (blue graph).

On the other hand, Einstein’s theory puts the same amount of prior probability (area the same) into a tall, skinny rectangle (in red, and I’m sorry it’s hard to see) near 43″/century, which is what is predicted. [Note: The amount of this motion is predicted from the theory and has to do with the constant of gravity and the speed of light; it is not put into the theory from the known value of the motion]. Now, if the perihelion motion had been well away from 43″/century, as shown in the blue bell-shaped likelihood curve (representing the errors of the observations), then Einstein’s theory would be dead, because the product of prior x likelihood for the complex theory there is greater than the small amount of prior x likelihood that Einstein’s theory can get from the tail of the likelihood. But, that’s not where the data lie. They lie almost smack dab on top of the tall skinny triangle, and now the product of prior x likelihood strongly favors Einstein’s theory.

Here’s a pointer to a paper that Jim Berger and I wrote about 20 years ago, explaining the Bayesian Ockham’s razor and the Einstein experiment. (The title when it was finally published is different from the running title. The editor of the journal didn’t like our title!)

Quiz #2

March 26, 2011

I asked everyone to think about dates for the second quiz. In particular, I need to know if you have a conflict that will mean that you CANNOT be present on a particular date. I am looking at April 11, 13 and 15. Please let me know if you have a problem with any of those dates.

We will discuss this and set the date on Monday, so it is especially important to let me know if (for example) you are feeling ill on Monday and cannot attend class. Be sure to e-mail me with problem dates if you cannot be present in class Monday.

HCOL 196, March 25, 2011

March 25, 2011

Today we discussed problem #2 of the last problem set. The calculations you did were mostly fine, but a lot can be learned by looking at this problem from an algebraic rather than a numerical point of view. So for the fair die, the numerical spreadsheet is above in the picture, and the algebraic one is below. Since there is only one value of p, the marginal P(D|F) is equal to the only joint entry, P(p,D|F).

For the biased die, the corresponding spreadsheet is as below:

For this problem, it is essential that the priors add up to 1; that is, you cannot use the shortcut of setting each prior probability to 1, knowing that you’ll divide out any constant when you compute the posterior. The reason is that for this problem, we stop with the marginal distributions and never carry out that last division. That means, that to get the correct marginal P(D|B), you must have the P(pi|B) for i=1, 2,…, 10 add up to 1.

Once you have the two marginals, a third spreadsheet combines them to compute the posterior probabilities:

The formula for the posterior probability of “F” is shown below (you can get this directly from the spreadsheet above):

We also have the relationship between the posterior odds and the posterior probability (above). In the case where there are just two states of nature, you can directly compute the posterior odds as the product of the prior odds and the “Bayes factor,” which is the ratio of the two likelihoods. This is sometimes convenient.

I pointed out that P(D|B) is an approximation to an integral, and the approximation gets better and better, the more rows you have in your spreadsheet. The integral is shown below.

I then turned to a parapsychology experiment that I once analyzed. Some researchers had published a paper in which they claimed that people could control a random event generator just by thinking about it. I found this very implausible. The experimental setup is shown in the figure:

The box has a red and a green lamp, and one or the other will light up at random, driven by electronics that measure random radioactive decays in a sample, with a 50% chance of red and a 50% chance of green. The subject (usually a student) tries to make the lamp come up one color (say red) more times than the other. A computer records the results automatically.

The experiment was run for many years with many subjects. Over 100 million random events were counted, with a slight excess in the intended direction. The p-value (two-sided) was very, very small, and would be considered highly statistically significant in the standard statistics world.

However, my Bayesian analysis is completely at odds with this result. The same data, using a method similar to the dice experiment we discussed in the problem, yields a Bayes factor of 12, which supports the null hypothesis that there is nothing funny going on. The data make me more confident that the subjects cannot affect the random event generating device than before I looked at these data.

The British statistician, Dennis Lindley, pointed out many years ago that this sort of thing can happen in theory. That is, a significance test can strongly reject the null hypothesis at the same time that the same data would support the null hypothesis when looked at from a Bayesian point of view. From my point of view, since I trust the Bayesian methodology, this is reason to be very wary of some aspects of standard statistics, in particular, p-values and hypothesis testing.

Finally, I mentioned that the prior probability still needs to be estimated. In a case like this, one way of assigning a prior probability is to imagine how much data you would require to change a skeptical point of view into one that was at least neutral towards the alternative hypothesis (that the student can indeed control the equipment). In my case, I would require that the student start the device with the intention to have 100 successive “red” flashes, for example, and actually obtain that outcome. That would be pretty impressive. The probability of that happening is 1/2100=1/1030, which means that my prior odds in favor of the null hypothesis would be approximately 1030, and my posterior odds about 12×1030.

HCOL 196, March 23, 2011

March 24, 2011

I started by returning briefly to the question of the prior on guilt or innocence that we discussed last time, but from a different point of view. In particular, the reason why we cannot take the Grand Jury’s indictment as a statement that the probability of guilt is at least 50% is that you cannot use data twice in a legitimate Bayesian calculation. Thus, the jury would be hearing the same evidence that has already been used to indict, and a correct Bayesian calculation will not allow the data to be used again. Here’s why: The first use of evidence E in a calculation is shown in the following spreadsheet (for this guilt/innocence example, but it is generally true):

Similarly, one can look at the explicit calculation of the same thing via Bayes’ theorem, which can be gotten by looking at the spreadsheet and doing the calculation explicitly:

If we try to use evidence E a second time, with P(G|E) and P(I|E) as the prior, Bayes’ theorem requires us to condition the likelihood on E, since that’s what the conditional probability formula says. So Bayes’ theorem applied a second time with the same evidence E looks like the top of the following whiteboard snapshot:

But P(E|G,E)=1! The probability that we observed E, given that we observed E is 1 because if we observed E, then it is a sure thing that we observed E. Similarly, P(E|I,E)=1, so that the denominator simplifies to just P(G|E)+P(I|E), which is, according to our spreadsheet, exactly 1 as well. So Bayes’ theorem tells us that P(G|E,E)=P(G|E), and our attempt to use the data a second time hasn’t changed the posterior probability.

Another way to think about this (not mentioned in class) is that P(G|E,E) is the probability of guilt given that we observed evidence E and that we observed evidence E. But {“we have observed E” and “we have observed E”} is logically equivalent to {“we have observed E”}. Saying it twice doesn’t change anything. So that means that on the right hand of the conditioning bar |, {E,E}={E} and so P(G|E,E)=P(G|E).

We then turned our attention to the O. J. Simpson case. You were all pretty young when this happened so I’ll remind you. O. J. Simpson was a famous football player who was married to Nicole Simpson, but they were separated. Their relationship was abusive; O. J. had frequently beaten Nicole, and the police had been brought in on numerous occasions.

O. J. had a lawyer, Alan Dershowitz, who is a professor of law at Harvard. On one occasion he made a remark to the press (a remark that never was mentioned at the trial). The remark was intended to make it appear less likely that O. J. was the murderer; but the actual effect was the opposite, as the statistician Jack Good showed. Dershowiz’ remark was that in any given year, only one in 2500 men who abuse their partners (batterers) goes on to murder their partner. This was intended to make such an event look very unlikely.

However, Jack Good learned that in any given year, only one in 20,000 women is murdered by a random stranger. This is even more unlikely. Now, O. J. had been arguing that the real murderer was a random stranger.

We can easily analyze this with a natural frequency tree, where we have three branches: One with women murdered by their partner, one with women murdered by a stranger, and one where the women are alive at the end of the year. Using a sample of 100,000 women, we find that 40 of the murdered women are murdered by their abusive partner, 5 are murdered by a stranger, and the rest are still alive.

Looking at those who were murdered, we find that the odds are 8:1 that a woman who is murdered was murdered by her abusive partner. So, this is evidence against O. J.’s innocence, not in favor of it. Using the formula we derived last time, that probability=odds/(1+odds), we find the probability of guilt from this particular piece of evidence is approximately 0.9 (and there was lots of other evidence as well, although the jury found Simpson innocent).

In the remaining 15 minutes I discussed how we can use evidence to determine whether a coin is fair or not, using Bayesian methods (Problem 2 in the problem set due today). This is one case where Bayesian and standard statistical methods are very different and give quite different answers. In standard statistics, we assume the truth of the “null hypothesis” of “fair” so that P(H)=P(T)=0.5 for a tossed coin, for example.

The probability of observing the exact data we observed, say 60 H an 40 T, is given by a binomial distribution, as on the whiteboard shot below, where 100C60 is the “choose” function, the number of different ways that you can select 60 objects out of 100 distinct objects. (We’ve seen this before). (Before we got to the binomial distribution, we had other suggestions, such as normal and chi-square; but these aren’t appropriate; for one thing they both have an infinite tail, and we know that when we toss a coin we get at least 0 heads and at most 100 heads, so infinite tails aren’t appropriate).

Here’s a smooth normal-like distribution. If we were testing a hypothesis using standard methods, we assume the null hypothesis (coin is fair) to be true, and then look at where the data lie. We then add up the probability in the tails, including not only the data we did observe (60 heads and 40 tails), but also all data even more extreme than that (including, perhaps the left-hand tail in this case of 40 heads and 60 tails, and all cases with less than 40 heads).

But a Bayesian answer isn’t going to depend on data that we haven’t observed, only on the data we have observed. So the Bayesian answer is going to depend only on the 60 heads and 40 tails data that we did observe, and not on the more extreme data that we did not observe. If you did Problem 2 correctly, you’ve already done this. We’ll discuss this more next time.

Interesting Item (Audio)

March 23, 2011

Freakonomics posted this podcast, which discussed how we often misperceive risks: Risks of getting eaten by a shark, killed in a terrorist attack, or dying in an airplane accident are much lower than many think, while the risk of brain injury in football may be quite high, and paradoxically may be exacerbated by the excellent helmet technology we now have. Worth a listen (takes about 1/2 hour, but you can put it into your iPod).

HCOL 196, March 21, 2011

March 22, 2011


Today’s discussion was about how to compute posterior probabilities of guilt. The states of nature are G, I, and the decisions are C, A. We need a prior on the states of nature. We cannot use the fact that the accused has been indicted, since that would already use evidence that will be used at trial, and it is wrong to use the same piece of evidence twice (and in fact, it is impossible to do so if the rules of probability are used correctly. I’ll say something about this on Wednesday.) One really should use a prior that expresses ignorance about innocence or guilt, and that prior would basically say, what is the probability of guilt, given that the accused was just randomly picked up?

Thought about this way, if N is the number of people in the area, the probability of picking someone out at random is 1/N, and that should be the prior on guilt. If we work with odds (ratio of P(G)/P(I)) then the result, for the number Ncc of people in Chittenden County, is as on the top line of the whiteboard shot below. (The lower lines were things we put in later).

To consider what would happen if we now have some evidence presented at trial, we imagine that evidence has been presented that shows that the accused had motive for the crime. The likelihood is shown below, where NMotive is the number of people in Chittenden County that have motive (this may not appear in the trial, we may have to estimate it using our knowledge of things in general). A guilty person is certain to have motive, but the probability that an innocent person has motive depends on the number of people that have motive.

When this is used to update the odds ratio, the results are shown in the second and third lines in the whiteboard shot above.

We then talked about DNA (or Fingerprint) matches. After some discussion, we found that the expert witness is going to give us a math probability, that is, the probability that there would be a match, given that the accused is just a random person. (Again, the probability of a match is 1 if the accused is guilty.) But this should not be confused with the probability of innocence, given a match. That requires us to use Bayesian reasoning. Confusing P(match|innocent) with P(innocent|match) is a mistake known as the Prosecutor’s Fallacy.

I pointed out that you can only add probabilities over mutually exclusive cases when the variable lies to the left of the conditioning bar |

See the second and third blue lines below.

The bit below is just reminding us that we need to get probabilities of guilt or innocence given a match, not the other way around.

And the calculation from prior through likelihood to posterior if the only evidence we look at is the match evidence, is shown below. Notice that our prior, which involves the number of people in Chittenden County, makes the evidence P(match|innocent) still yield a posterior (with these example numbers) that isn’t enough to convict.

You convert an odds ratio, letter O (terrible notation, it looks like a zero) into a probability using the following math:

I noted that on Wednesday we will discuss the O. J. Simpson case and I urged you to check that chapter in “Calculated Risks”.

HCOL 196, March 18, 2011

March 18, 2011

I started out mentioning a podcast that I had just listened to. It is about 1/2 hour long but interesting. In particular, I mentioned that the book “Predictably Irrational” is largely about how we make suboptimal decisions because of things that are not relevant to the decision.

We  then discussed the death penalty problem. Here we have six outcomes, since there are two possible sentences, life without parole or death (in this example). The best outcomes are the correct ones: AI and CGL. Ranking them is difficult, some ranking them one way and some the other. For the sake of the example, since everyone thought they were pretty close, we assigned a loss of 0 to each. The next best outcome is AG, to which we assigned a loss of 1. It turns out that it doesn’t matter what the loss for CGD is so I just assigned a loss of  Z to it. We then assigned the remaining losses using test trees as we did last time.


To assign a loss for CIL (life but convicting an innocent person), the tree is similar to last time, except that since the penalty of life in prison is more severe than the 5-10 years of the last example, we are even more reluctant to make this mistake, so the probability of making the mistake is smaller. We agreed on p=0.001, which means a loss of 1000.

The worst outcome is CID, executing an innocent person. One student thought that p should be 0 in this case (I agree!), but it’s better just to  pick a non-zero probability that makes us comfortable. People proposed very small numbers, and we settled on p=0.0001, which gives a loss of 10,000,000.

(Click on image above to get full-size image in another window.)

Each juror would have to evaluate the losses for herself. But for these losses, the tree for making the decision is shown above. In this tree, p is now the posterior probability of guilt that each juror evaluates (we’ll discuss this next time). We see two things: (1) The death penalty cannot be the lowest expected loss decision, since it is always worse than life without parole. Basically, this result will be independent of your losses, as long as you think it is worse to execute an innocent person than to put him in prison for life. Just this one fact makes execution always much riskier than life in prison. The value of Z doesn’t affect this as long as Z≥0. Even if, perversely, you thought that Z should be negative (that is, a better outcome than acquitting an innocent person, which seems strange), this result would not change unless you thought that execution of a guilty person was just as hugely better than acquitting an innocent person as you thought that acquitting an innocent person was hugely better than executing an innocent person!

Then we evaluated the value of posterior probability of guilt that we need to have to convict and (as we learned) send the accused to prison for life. It turns out to be about p=0.999, which means you are very sure of guilt.

An opportunity

March 17, 2011

I just got email from Dan Ariely, who is the author of your book, “Predictably Irrational.” It describes a summer research opportunity for both undergraduates and graduate students. Here is the link to the web page.