HCOL 195 10/30/09

November 1, 2009 by bayesrules

Today we looked at the homework. The second problem was similar to one you’ve already done, so we just looked at the first one. This is an example where Bayesian answers are very different from those gotten by frequentists. The idea here is that we have a precise hypothesis (the coin or die is fair) and an alternative one (that it is biased). In the first case, the probability of one outcome is specified precisely, but in the other, the probability of the outcome is unknown. Since it is unknown, the Bayesian thing to do is to regard the bias itself to be a state of nature, and put a prior on it. Then we have a prior on the two hypotheses (fair, biased).

This is one case where we actually need to put a normalized prior on the value of the bias. Unlike the cases we have treated so far, in the final analysis in this case, there is no cancellation of factors. So in the biased case, we assumed biases of 0.05, 0.15, 0.25,…,0.95, and put a prior 1/10 on each. (If we wanted to be more precise, we could put the prior on 0.005, 0.015, 0.025,…,0.995 and use a prior of 1/100 on each; that would require a spreadsheet calculation). The likelihood under this case is P(data|p, biased)=ph(1-p)t, where h is the number of heads and t the number of tails (the data). The prior is P(p|biased)=1/10, and the joint probabililty is P(data|p,biased)P(p|biased)=P(data,p|biased). Summing over all values of p (getting the marginal) gives us P(data|biased). Here’s a snapshot of the whiteboard that results (not all numbers are filled in):

HW 1

Spreadsheet for biased (loaded) case

The calculation for the fair case is easier. In our example, if the die is fair, p=1/3 and (1-p)=2/3, so the likelihood is P(data|fair)=(1/3)h(2/3)t.

We adopted P(fair)=P(biased)=1/2. With these, we are now able to calculate the joint probabilities from the marginal for the biased case and the likelihood from the fair case:

P(data,fair)=P(data|fair)P(fair), P(data,biased)=P(data|biased)P(biased).

But also, P(data,fair)=P(fair|data)P(data), P(data,biased)=P(biased|data)P(data). Dividing these two we get just the ratio P(fair|data)/P(biased|data), which is the posterior odds ratio. Because we chose P(fair)=P(biased), this is also equal to B=P(data|fair)/P(data|biased), which is the Bayes factor. The probability of fair, given the data, is equal to B/(1+B). Here’s the whiteboard after this calculation:

Calculation of the posterior probability of fair

I then described a practical application of this theory. There was a project at Princeton University which was attempting to find evidence for paranormal powers. In one of the experiments, a student was placed in front of a device that randomly flashed red and green lights, and attempted, by pure thought, to “influence” the device so as to make the number of flashes of one of the colors greater than the number of flashes of the other color. The desired color was changed from time to time, so that on some runs, the student tried to make red flash more often, and on some others, green:

The experimental setup for a parapsychology experiment

I read a paper by these experimenters; they reported data on over 100 million trials that they had conducted over the years (various students). In these trials, there were an excess of 18,471 flashes in the desired direction, less than 0.02% of the total. Even though this was a very small excess in absolute terms, the p-value, that is to say, the probability of getting an excess of 18,471 or more flashes, was also very small, about 0.0003 (I got the number wrong on the board, there should be one more zero!) This would be regarded as a highly significant rejection of the hypothesis that the device is fair.

Yet, the Bayesian calculation is very different! Doing the exact calculation that is approximated by the spreadsheet method we described above, I found that the Bayes factor was about B=12, which corresponds to a posterior probability in favor of the fair hypothesis of about 0.92! The Bayesian calculation supports the hypothesis that the device is fair, in contradiction to the significance test.

I wrote a paper on this subject and published it in the same journal where the original research was reported. This led to an exchange of letters to the editor.

Bayes factor for parapsychology experiment

How to explain the discrepancy? First, note that the p-value and the posterior probability are different things. The posterior probability is the probability of fair, given the data. But the p-value is the probability of the data, or any data even more extreme, given that the device is fair. We really want the Bayesian answer, but the frequentist calculation can’t give us that.

Bayesians regard the frequentist calculation as the right answer to the wrong question. It has a number of defects: First, it doesn’t say anything about any probabilities if the device is biased, yet it purports to tell us something about the coin being biased. Secondly, the probability calculated is based not only on the data that were observed (18,471), but also on all the possible data that were more extreme and which were not observed! Moreover, the more extreme data are not expected to be observed, just because they are more extreme. There seems to be something incoherent about basing a conclusion mostly on data that were not observed and were not even expected to be observed!

Dennis Lindley, a British statistician, pointed out that just this kind of outcome can happen: A statistical significance test (the p-value) can reject the “fair” hypothesis with a very small p-value, yet the Bayesian calculation can strongly favor that hypothesis.

HCOL 195 10/28/09

October 29, 2009 by bayesrules

Today we picked up where we left off last time. Here’s the board as we left it then:

Estimating Losses

No one thought that the value of p should be as big as 1/2; p=0.1 seems to be close to the median for the class. When you put this in, then the value of the loss for CI that makes the two branches have the same expected loss is 10.

Next, we put up a chart showing how a juror would decide a case. The juror would put the value of p that he or she has estimated from the data, and pick the decision that had the lowest expected loss:

Jury Decision Chart

But as the tree shows, this means that the strength of the evidence that would put an innocent person in jail isn’t really very great. Considering that the standard should be “beyond a reasonable doubt,” a probability of 0.9 for guilt seems too low (and the class unanimously thought so). So we bumped the loss for CI up to 100, which means you’d have to be 99% sure of guilt before you’d convict:

Revised Decision Chart

We then considered priors, and DNA evidence. For a prior, we considered the idea that in a geographic area, without any evidence (that is, picking someone at random), the prior for someone being guilty should be approximately 1/N where N is the population of the geographic area. So, for example, in Chittenden County we estimated the population at approximately 100,000 (it is actually about 50% higher than that). So the prior probability of guilt is about 1/100,000.

The question was raised, shouldn’t we use something like 1/2? After all, the person is on trial! They wouldn’t get there if they were almost sure of being innocent! The problem with this line of reasoning is that the person would have been indicted by a Grand Jury, which would have based its indictment on the very same evidence that the jury is supposed to consider in the trial. So, even if the evidence convinced the Grand Jury that a trial was warranted, e.g., the the probability of guilt was over 0.5, to use that number as a prior would in effect be using the same data twice, which is forbidden in Bayesian inference. You have to use a prior that is independent of any of the evidence that will come up in trial, one that depends only on general principles that are known outside of the details of the crime or the defendant. The population idea is one such; you might get another factor of two if the defendant were a man, since most crimes are committed by men. But that’s an insignificant factor. Also, that factor can just as well be built into the likelihood, which is probably a better place to do it.

We then considered hypothetical DNA data that has a probability of 1 in a million of matching a randomly chosen person (but a 1 in 1 chance of matching the perpetrator, of course). This is commonly thought to mean that there is a 1 in a million chance that the defendant is innocent, but this is incorrect. P(match|innocent) is not equal to P(innocent|match), and thinking that they are equal is known as the “prosecutor’s fallacy.” The actual calculation is shown in the chart below:

DNA Decision

The calculation gives a probability of guilt at about 0.9, which is insufficent to convict if the loss for CI is 100. More (independent) evidence would be needed.

We then turned to the O.J. Simpson case. One of his lawyers had remarked to the press that in any given year, only 1 in 2500 batterers goes on to murder his partner. He meant this to show that it was unlikely that O.J. committed the crime, but it doesn’t take into account the fact that in a given year, only 1 in 20,000 women is killed by a random stranger:

Probabilities for OJ Simpson

When this information is entered into a Natural Frequencies chart, imagining a base population of 100,000 battered women, about 40 (that is, 100,000/2,500) will be killed by their batterer, but only 5 (that is, 100,000/20,000) would be killed by some random stranger (of the kind that O.J. himself claimed to be “seeking.”) So, the probability that the batterer does the deed is 40/45, greater than 0.9. Thus, the evidence that Dershowitz brought forward actually supports the hypothesis that O.J. did the deed, rather than undermining it.

OJ Simpson Natural Frequencies Chart

HCOL 195 10/26/09

October 26, 2009 by bayesrules

Reminder: The project handout just had some ideas, you aren’t restricted to them and I am delighted when a group works on something completely new.

Note that there’s a difference between Bayes and frequentist ideas. In particular, in Bayesian thought it is perfectly legitimate to talk about the probability of something that just happens to be unknown to us, but is perfectly certain. For example, we can talk about the probability that the Nile is over 1000 miles long…think of it as a bet, for example, what odds would you be willing to give someone else to take either side of a bet that the Nile is over 1000 miles long? If you would be willing to bet at double or nothing, for example, then you think that it’s a 50% probability that the Nile is over 1000 miles long. Frequentists aren’t allowed to use probability this way.

In particular, you should not be thinking of the utilities and losses we’ve been discussing in terms of many, many bets. For example, if you own a house, you shouldn’t be thinking of a lot of identical situations where your house may or may not burn down in a given year. Either it does or it doesn’t. If the house is worth, say, $200,000, and there is a 1 in 1000 chance that it burns down in a year, then the fair value of the expectation of a bet with an insurance company (the premium) that the house will burn down is $200, but you would never get an insurance company to take that bet. They will require significantly more, to cover their fixed expenses and (over many different houses with many different customers) to have a high probability of making a profit for their shareholders.

When we discussed a sure $100,000 versus a 50:50 bet of $1,000,000 or nothing, many preferred the sure thing. This is because the additional $900,000 isn’t (for these folks) as valuable as the first $100,000.

I’ve already posted the link to the podcast on Elinor Ostrom (see previous post). The podcast says everything.

I noted that the patient is the one that has the responsibility to make decisions about medical care. This is because the patient is the one that suffers the consequences. The role of the doctor is to explain the treatments, the consequences, and how likely the various outcomes are, in a way that the patient can understand well enough to make informed decisions. Similarly, lawyers cannot tell their customers what they should do. They are like doctors: Explain the law, and the probable consequences if the client decides on various different courses of action.

There are no riskless actions. Even just lying in bed has risks. Crossing the street, you could get hit by a bus and killed. You are willing to do this for a meal worth a few dollars only because the risk of getting hit by a bus is very low. This can be used in principle to decide on how valuable (in dollars) you think your life is.

One student showed that it is better to buy two lottery tickets on different numbers than to buy two tickets on the same number.

We then discussed the problem of which is worse: Convicting an innocent person (CI) or acquitting a guilty one (AG).

If you acquit a guilty one, then that person will be free to reoffend; on the other hand, the fact that we have his fingerprints, DNA, picture, and other information about him might serve to deter him to some degree and will make him easier to catch.

If you convict an innocent one, then the real culprit is still at large, free to commit another crime. We don’t have any accurate information about the culprit (no DNA, no picture, no name, no prints), and the police will have stopped looking for him. So he may be more prone to commit other crimes. In addition, there is an innocent person in prison, which is another bad thing.

On balance, it seems that CI is worse than AG.

This leads us to consider the following decision tree (assuming that the good outcomes, CG and AI are equally good with a loss of 0). We adopt a loss of 1 for the intermediate decision, AG. We put the worst one, CI, at the top of the chance node. I asked you to think about what value of p would make you indifferent between the two decisions. We’ll discuss it next time.

Tree to decide on how bad CI is relative to AG

Podcast about Nobel Prize winner Elinor Ostrom

October 26, 2009 by bayesrules

As I mentioned in class today, there is an interesting discussion with Nobel Prize winner Elinor Ostrom on the NPR website. It may be listened to or downloaded here.

In addition, there are several interesting letters on the mammogram/prostate cancer discussion, which can be found here.

Finally, there’s the article about the M. D. Anderson Cancer Center, in Houston, that I mentioned. You can find it here. Dr. Don Berry, who is mentioned, is a Bayesian statistician. He heads their division of quantitative sciences.

HCOL 195 10/23/09

October 25, 2009 by bayesrules

Today we discussed the homework.

First the lottery problem. There were several things here that not everyone thought of. One important thing is that there were 200,000,000 tickets and a chance of 1/80,000,000 that a particular ticket would win. This means that in a series of such lotteries, we can expect 2.5 tickets to win on average, so you’d have to share your prize with 2.5 people, making the prize worth about 112,000,000 to you, not 280,000,000.

A refinement of this is to figure out the probability P(n) that there will be n=0, 1, 2,… other winners, and thus figure out the amount that you’d win in each of these cases, setting up a probability tree with many nodes instead of just putting $112 M. If p is the probability that a single ticket wins (p=1/80 M), then (1-p) is the probability than the ticket loses, and the probability that all N=200,000,000 tickets lose is (1-p)N=0.0821. The probability that a specific ticket wins and all the others lose is p*(1-p)(N-1)=p*0.0821, since there is hardly any difference between (1-p)N and (1-p)(N-1). But there are N tickets out there, so the probability that one of them wins and all the others lose is N*p*0.0821=0.2052 (I just realized I wrote the wrong number on the whiteboard). For two, the probability that a particular two win, one buying the ticket first and the other later, and all the others lose, is p2*0.0821; but there are N of the first and (N-1) of the second, giving a factor of N*(N-1), which is essentially N2, and there are two orders in which the tickets could have been bought, so this has to be multiplied by 1/2, giving a probability for two winners other than yourself of (N*p)2*0.0821/2=0.2565. In general, for k other winners, the probability is (N*p)k*0.0821/k!. The calculation of the probabilities of the various branches is outlined here:

Probabilities of the branches

Probabilities of the branches

The first few of these are in the picture of the completed tree (without calculations):

Lottery Tree

Lottery Tree

But there are two other flies in the ointment. First is taxes: You don’t get to keep all the money, you have to give Uncle Sam 39% and some to Jim Douglas (if you live in Vermont). Second is annuitization: The only way you can get the full jackpot is to have the lottery buy an annuity for you that will pay you the amount over 20 years in equal installments. But if you take the money immediately (probably the best choice), they will only give you the amount that they have to pay the insurance company for the annuity, which is about half of the jackpot. So you will get about 0.5*0.6=0.3 of the amounts in the figure. So, if you are the only winner, your net take after taxes would be, not $112 M, but only $33.6 M. That’s the figure that really should be entered as the gain, and when you do this (and similarly for the other numbers), the “buy the ticket” branch will actually have a loss.

What does the lottery do with the $140 M that it doesn’t have to pay out? It uses it to finance its beneficiary, education mostly. So net, the lottery is actually a tax that people are willing to pay.

Then we turned to the lawsuit problem. Most groups did a pretty good job; there were some calculational glitches but they were minor. One group added a third branch, “just continue the lawsuit,” which wasn’t among the choices, but the tree says that this isn’t the best choice (I’ve added this below). The only unusual item in this tree is the second box that comes if the other side makes a counter-counter offer of $3 B. Since this choice comes later on in the logic, it is to the right of the first decision box. The final tree is here:

Lawsuit tree

Lawsuit tree

HCOL 195 10/21/09

October 22, 2009 by bayesrules

This will be short. We filled out the “attitude towards risk” form and plotted our risk profile. We found three typical forms, namely:

Risk Averse Profile

Risk Averse Profile

The first, risk-averse profile is a very common profile; it says that a person, when considering a gain, is willing to accept less than the fair or expected value of a risky proposition in order to lock in a sure thing gain. So, for example, one might be willing to accept $4,000 as a sure thing rather than a 50-50 chance on a gain of $10,000. Similarly, one might be willing to accept a sure loss rather than run the risk of a much larger loss that only happens with some probability. Most people have a risk profile like this.

Risk Neutral Profile

Risk Neutral Profile

This profile is risk neutral (a straight line). It represents the risk profile of a large company, like an insurance company, that has many bets out, some of which it will win and some of which it will lose, but which can be predicted statistically with high accuracy by the companies actuaries. The difference between this kind of risk profile for an insurance company and the risk-averse profile of a typical insurance buyer (the first plot) explains how it is that people will willingly buy insurance, willing to pay a fixed premium to an insurance company, to (for example) avoid financial disaster if their house burns down, and at the same time the insurance company is willing to take on this risk, since they can charge each policy holder a premium that will, on average, more than cover the expected losses in aggregate. Because of this difference, the insurance company can expect a profit with a high degree of certainty, a profit that will be distributed to the shareholders as a dividend or (in the case of a mutual insurance company, which is owned by the people who have policies) a reduction of premiums.

Risk Seeking Profile

Risk Seeking Profile

This last profile is anomalous: It is risk averse for gains, but risk seeking for losses. It is sometimes seen in practice at casinos, where someone who is “down” may take extraordinary risks to try and get even. This is not usually a good idea.

Finally, we had a visit from Brit Chace, the HC Student Fellowship Advisor, on fellowship and scholarship opportunities (like the Rhodes and Marshall Scholarships, which allow students to study in England, and the Goldwater Scholarships). Her office is right across the hall from our classroom, and she encourages everyone to visit with her and discuss these opportunities.

New York Times article today

October 21, 2009 by bayesrules

The Times had an article today that discusses the consequences of false positives in the context of mammography and the PSA test for prostate cancer. Worth reading!

And here’s another article that came out on Thursday morning.

HCOL 195 10/19/09

October 20, 2009 by bayesrules

We discussed the test. No problems with #1. On #2 the most effective way to do it is to list the possibilities and count up those that have the first child a boy (hence the king) and then count the individual cases: two brothers, two sisters, one brother and one sister. The four cases that are relevant are

BBB
BBG
BGB
BGG

Note that BBG and BGB are not the same. Therefore there is a 1/4 probability of two brothers, the same for two sisters, and a 1/2 probability of one brother and one sister. These add up to 1.

In problem #3 there are 10 SON (1,2,…,10); The likelihood for each SON is (SON/SON)*((SON-1)/SON)*(2/SON)*(2/SON). The denominator is always the SON since that’s how many fish there are in the lake each time we catch one. The first two numerators represent the number of untagged fish left in the lake, and the second two the number of tagged fish in the lake, for the four fish we caught. One student started with the smallest SON=5, but that’s not what the statement of the problem says. Also, tagged vs. untagged are not states of nature, they are data.

Problem #4 is easiest done by using natural frequencies: If we have 2000 patients (may as well use that number as it is directly useful for the last question), then 1%, or 20 of patients will have the disease and 1980 will not. Of the 20 that have the disease, 19, or 95%, will test positive. The remaining one will test negative. Of the 1980 patients who don’t have the disease, 4%, or 79 will test positive (it’s really 79.2, but we can round here without sensible error). That’s the answer to the number of false positives in the group of 2000 patients. The probability of having the disease, given that you test positive, is 19/(19+79)=19/98, or a little over 0.19.

The fifth problem has a table of independence. The marginals are .25 and .75 in the horizontal direction and .5, .2 and .3 in the vertical direction. Each entry in the joint table is the product of the corresponding marginals, which proves the result. To make it independent, you can add a fixed number to two rows and subtract the same number to two columns; this would involve four numbers changed in the table.

For the last problem, pick to use either gains or losses and stick to it. Losses is easiest; then there is a loss of $800×10 million or $8 billion if we require installation; if we do not require it, then there will be a loss of 10,000x$5 million, or $50 million due to lives lost that might have been saved. The second loss is greater, so we should reject that branch and require installation of the safety device. Note that you use each number exactly once: Some students tried to use the numbers on both branches, once as a gain and once as a loss, but that doesn’t work.

I asked whether people would prefer $100,000 as a sure thing or a 50% chance at $1 million and a 50% chance of nothing. About half the class preferred the sure thing, and half the gamble. We then said, what if the probability of getting the $1 million were 0.1, 0.2,…,0.9, 1.0. As the probability ramped up, more people were willing to take the gamble, but two students would only go for the $1 million if it were a “sure thing”, that is the probability were 1.

Then we talked about being on a jury. We decided that the four possibilities are: AI (acquit someone who is innocent), CI (convict someone who is innocent), AG (acquit someone who is guilty) and CG (convict someone who is guilty). We discussed which of these were the best and the worst outcomes. While it is clear that making a right decision (AI or CG) is good, and making a wrong decision (CI or AG) is bad, we didn’t come to agreement as to how to order the two good ones and the two bad ones. We’ll bring this up again later.

HCOL 195 091016

October 17, 2009 by bayesrules

Today we just talked about my experiences with the Hubble Telescope project, and in particular how the bad mirror happened and what was done about it.

Monday, I’ll return the graded tests and discuss the results.

Friday

October 14, 2009 by bayesrules

Some important points:

There is NO journal due on Friday this week. Next journal is due on Friday, October 23.

I intend to talk about something fun on Friday, namely, how we use Bayesian inference to solve problems in astronomy. I’ll describe some work that I have been involved with.