Archive for the ‘HCOL 196’ Category

HCOL 196, May 4, 2011

May 4, 2011

We had a short class today. I mentioned that the student who said that the last smallpox victim was in Somalia was correct (it was a less virulent version of the disease known as Variola minor); I was also correct in noting a victim in India, the last of the more virulent version, Variola major. But these were all victims who caught the disease “in the wild.” The real last victim was a medical photographer who was exposed by mistake in the UK. Unfortunately, she died. Here is the WikiPedia entry.

I mentioned also an article about using linguistic evidence to learn about the origins of the Japanese people. They constructed a “phylogenetic tree” of the language from the various dialects in existence (and some other information), similarly to the way biologists construct phylogenetic trees of organisms from DNA evidence. We’d heard about this before. The people who did this study used Bayesian methods.

I also mentioned the doctoral dissertation defense of a student I had in my graduate Bayes course some years ago, who got interested in Bayesian statistics and has used it to construct phylogenetic trees for organisms. This was a very difficult program, and he did a very nice job. Had I known that his talk would be so well presented, I would have posted the information for you, I am sure most of you could have followed it.

I then left, and you all filled out the course evaluation forms.

I’ve enjoyed teaching this class this semester. You have all been very interested and have worked hard. I hope you all have a very pleasant summer (I know that some of you are travelling abroad, and to you I say “Bon voyage!”, or if you are going to China, “一路平安!”)

HCOL 196, May 2, 2011

May 3, 2011

Short entry today. We discussed my comments on the written part of the projects.

We also talked about the big news of the day, the death of Osama Bin Laden.

Please attend on Wednesday. The last 15 minutes of class will be the Course-Instructor Survey. Your feedback is very much appreciated!

HCOL 196, April 29, 2011

April 29, 2011

Again, I thank all of you for your work on your presentations, and today’s presentations on Prosecutor’s Fallacy and Human Medical Experimentation, both of which were thought-provoking.

Monday I plan to discuss the presentations, and I will give each of you your current grade (assuming that you don’t do anything more).

Meantime, check out this cartoon. I think you will be amused.

HCOL 196, April 27, 2011

April 28, 2011

Thanks to our two teams who presented their findings about Pascal’s Wager and the decisions about legalization of drugs.

See you all on Friday. Please be prompt. The amount of time is limited.

HCOL 196, April 25, 2011

April 26, 2011

Reminder: On Wednesday and Friday, when you do your presentations, it is very important to be prompt to class as we have a limited amount of time.

On Monday I talked about another astronomical problem that we have used Bayesian methods to solve. Like the problem we discussed last time, this one involves a special kind of variable star, this time called RR Lyrae stars (again, named for the first one of its type discovered). These are fainter than the Cepheids, but they are known to have the same luminosity. So again, if we can calibrate the luminosity, we can use these to measure the distances of objects that contain them. Again like the last time, the nearest of these stars is a little too far away to measure its distance directly, and we only were able to do this with the Hubble. But we’d like an independent check. This time we can’t use the method that we did for the Cepheids, but we know that these stars have low velocities relative to the galaxy (and thus high velocities relative to the Sun, which is moving rapidly around the galaxy). To a first approximation, consider four such stars, located at 12:00, 3:00, 6:00 and 9:00 on a clock face relative to the Sun (see picture). The stars at 12:00 and 6:00 can have their velocities towards or away from the Sun measured by Doppler shift, just as we measured the velocity of the pulsations of the Cepheid variable. These velocities don’t depend on how far away the star is. The stars at 3:00 and 9:00 aren’t moving towards or away from the Sun, they are appearing to fly by us as trees at our sides would if we were driving down the highway. They have an angular velocity, and their angular velocity depends on their distances. We can measure these angular velocities by observing a star at two times, noting the change in angle, and dividing by the time interval. Then, just as in the Cepheid case, we have an angular velocity and a linear velocity (on two stars), and this allows us to calculate the distance of the stars at 3:00 and 9:00, and thus calibrate the brightness of these stars (and by extension, all RR Lyrae stars).

It’s a bit more complicated than this of course. In the first place, we aren’t going to find stars at exactly those points, so some geometric modeling is needed. Also, the stars do not have exactly zero velocities relative to the galaxy, they do have their own motions, so we’re going to have to model that statistically as well. This is easily done using the Bayesian method.

I noted that in general we cannot use the “spreadsheet” methods of the class to solve problems like this. Although the basic ideas remain the same, if you have more than two parameters of interest, then it rapidly becomes impractical. For the past quarter century, statisticians have been using another technique, which involves using various strategies to draw a large sample from the posterior distribution and then make inferences from the sample. But we start at the same basic place. With x representing the data and \theta the states of nature, we start with a prior on \theta, a likelihood, multiply them to get the joint. The problem is that to compute a posterior distribution we need to divide the joint by the marginal, and the marginal is computed by integration in general (unless the parameters are discrete so that we can sum). But integration is hard. So statisticians have learned how to avoid that step altogether by drawing from the sample.

The basic technique is called Markov chain Monte Carlo (I didn’t name it in class). Markov chains are a particular type of mathematical object, and Monte Carlo refers to the famous casino on the Mediterranean coast. A simple example is shown in the diagram, where we are trying to draw a sample from a normal (bell-shaped) distribution. We start anywhere, with some value \theta_0. Starting there, we generate a new \theta^* from a uniform distribution between \theta_0-d and \theta_0+d. We then compute the ratio of the joint distribution at the new point to that at the old point. We generate a uniform random number between 0 and 1, and compare it to the ratio. If it is less than the ratio, we jump to the new value, so \theta_1=\theta^*, otherwise we set \theta_1=\theta_0. We then repeat the procedure as long as we wish, obtaining a large sample.

If you think about it, you can easily see that if the distribution at the new point is bigger, you will definitely move there, but if it is less, you will sometimes move there and sometimes not. This means that the points will concentrate near the peak of the curve, and in fact the sample will be a very good approximation to a sample from a normal distribution.

This is a very simple example. There is an art to designing programs that will sample efficiently, but it’s not hard to design programs that will sample effectively, given enough time.

There are many variations on these schemes. Usually it is easiest to sample on a small number of parameters at a time. You can, in the following figure, sample first on \theta_1 and then on \theta_2 and then repeat.

The method was first developed at Los Angeles National Laboratory in the 1950s, and was used to develop nuclear weapons.

HCOL 196, April 22, 2011

April 24, 2011

We postponed the presentations until Wednesday the 27th and Friday the 29th, so that we would not interfere with people’s holiday plans. Each team knows when they are scheduled to present. It is important for people to arrive on time for presentations, because the time available is limited. Please, everyone, make every effort to be on time.

On Friday I talked about work that I’ve done using Bayesian methods. I talked about a particular kind of star, Cepheid variables, which are very important in astronomy. These stars are fairly bright, and they pulsate regularly.

In 1908, Henrietta Leavitt discovered that the intrinsic luminosity of these stars depends on their pulsation periods: The longer the period, the more luminous the star is. This is called the period-luminosity relationship. Of course, as the star gets farther away, it will become fainter to us, but the intrinsic luminosity doesn’t depend on the distance of the star. This gives us a way to measure the distance to one of these stars, if we know its intrinsic luminosity: Measure its apparent luminosity, which decreases as the square of the distance to the star (the “inverse square law”), and calculate the distance from the intrinsic and apparent luminosities.

To do this, we need to calibrate the period-luminosity relationship. That is, even though we know from Leavitt’s work that there is a relationship, we do not have the tools to convert a period into an intrinsic luminosity. Leavitt was looking at these stars in objects that had a number of them (the nearby galaxies known as the Magellanic clouds), so that all the stars were at about the same distance from us. Since they were all at the same distance, the period-luminosity relationship showed up as a relationship between the periods and the apparent luminosities of these stars. But no one knew the distance to the Magellanic clouds, so this was missing a crucial piece of information, that is, the distance to at least one Cepheid variable.

The direct method of measuring the distance to a star is to observe it as the Earth goes around the Sun. Because of the Earth’s motion, the direction to the star changes slightly. If we can measure this very small change in angle, we can determine the distance to the star by trigonometry, as shown in the figure:

The trouble is that the angles are very small, less than 1 second of arc even for the nearest star (Proxima Centauri). When we started our work on the Hubble, the errors of parallaxes were about ±0.01 second of arc, which means that we could not measure distances to stars more than about 100 parsecs, or 326 light years away. Unfortunately, the nearest Cepheid is almost three times farther away than that, so another method was needed. (Since then, we have measured the distance to this star to quite high precision, but we’d like to have other, independent measures). But, as you can see from the picture of our galaxy, most stars are just too far away.

The pulsations affect both the size of the star (which expands and contracts) and the surface temperature of the star (when gases expand, they cool, and vice versa). We can measure both the apparent brightness and the temperature of the star as it expands and contracts, and that allows us to measure the angular change in the size of the star quite accurately (see “measure as angle” in the diagram below). At the same time, we can measure the linear changes in the size of the star (in meters, for example) by observing the change in the wavelengths of spectral lines emitted by the star, through the Doppler shift (the same principle that police use to detect speeders). The angular change and the linear change are related to the distance of the star, so from this information we can calculate the distance to the star (in the equations in the diagram I’ve converted to compatible units, this is just for illustration).

Our problem was fairly complicated, since the light curves of Cepheids are not nice smooth waves, but they rise slowly and fall more rapidly. We can approximate the complicated actual curve by adding up nice smooth waves of different frequencies, with appropriate amplitudes. Estimating those amplitudes is at the heart of the Bayesian program that we developed. Our results were consistent with the results of other methods.

So, with the period-luminosity relationship calibrated, astronomers are able to measure the distances to many distant objects that contain Cepheid variables (mostly galaxies). I hasten to point out that many astronomers have worked on this problem over the decades, and our work is intended to augment what others have done.

HCOL 196, April 20, 2011

April 21, 2011

We discussed the second quiz.

The first question asked about the probability that the experimental drug is better than the standard drug. Some people thought that the 0.2, 0.3, 0.4, 0.5, 0.6 were probabilities; they are not, they are the states of nature for this problem. The probabilities for the standard and experimental drugs were the numbers in the corresponding (second and third) columns. In the chart, I have put them in parentheses to the left and above the states of nature. Since the control and experimental drugs are done independently, the probability for state (r,s) is just the product of the corresponding posterior probabilities from the table. For example, for the SON (0.2, 0.3), the corresponding probabilities are 0.4 and 0.2, so the number to put in the corresponding box is 0.08. You only need the boxes above the red diagonal line, since those represent the boxes for which r>s. Then just add up those boxes and the result is 0.66. A shortcut method is to do it row by row, the answer is the same (see red numbers on the right of the picture).

There was no need (and no way) to start with prior and likelihood to get posteriors for this problem. The posteriors were already given to you. This was to save you time.

For the polling problem, the critical thing is to recognize that H on the first coin and T on the second (HT) is a different outcome from TH. The probability tree shows us that the probability of HH is equal to the probability of TT, 1/4, and the probability of a head and a tail (in either order) is 1/2. This means that we expect 25% of the answers to be “programmed” yes, corresponding to HH. Since 40% of the answers were yes, the remaining 15% were no. Similarly, 60%-25%=35% of the nos were not programmed. Of the unprogrammed answers, 30% were yes and 70% were no.

Here, we approximated the stockbroker’s claim by assuming that his actual hit rate is 80%, which is what the data said. This is obviously an approximation, since it could be some other number and still be consistent with the observed data. So the calculation is going to overestimate the probability that the stockbroker can actually do what he says. If we wanted to do better, we would put (say) half the prior probability on 65% (which says he can’t do it), and parcel out the remaining half amongst all the other possibilities. But that is too much for a 10 minute test question, hence the approximation. (Some people outlined the better scheme, and they got credit for that).

So the states of nature are two: Either the broker can’t really do better than the market, or he can (represented by 80%). In the table I tentatively put 1/2 on each possibility. The likelihoods use the probabilities 0.65 (o.35) and 0.8 (0.2), raised to the power of the number of successes (failures). The calculation isn’t hard, but I gave credit if you got this far and then explained how to complete the calculation. The result is approximately 0.4 for the posterior probability that he can’t do it and o.6 that he can. However, experienced investors know that it is really hard to beat the market, and if this were taken into account by changing the prior appropriately, it is quite likely that we would conclude that the stockbroker can’t really do what he claims.

For Problem 4(a), I realized after class that I had miscalculated the numerical value of the likelihood for “guilty”. I had apparently raised the 0.99 to the 6th power in my calculator (which has pretty small buttons), and not the 9th power. I’ll say more in a moment.

The first step is to recognize that the juror cannot use the fact that the accused was arrested in any way to set the prior probability. In particular, you can’t say that the grand jury indicted, so the probability is at least 1/2 that he is guilty, because the grand jury would have used the evidence to draw that conclusion, and you are going to see the same evidence, and a Bayesian can’t use the same evidence twice. About all you know is that there are a million people in the city, and as far as you know (since no evidence is available yet) is that the accused was randomly picked out of the population. So the prior on guilt is 1/1,000,000. You can safely write the prior on innocence as 1 (we are going to be dividing by the marginal anyway, and 1 is very close to the implied prior 999,999/1,000,000; the difference is insignificant).

The probabilities to be used in the likelihood are for innocence 0.1 (o.9) and for guilty 0.99 (0.01). Note that again, each pair of numbers add to 1; this mistake was made by several people. The corresponding numbers are raised to the 9th and 1st power, respectively, for the corresponding likelihoods.

At the end I noted that you can do the calculation of 0.999 by using a binomial expansion and dropping all but the first and second terms (hopefully you remember this from high school). That is shown at the bottom of the page. It was when I looked at this line after class that I realized I had miscalculated it on my calculator when I prepared for class discussion. The picture below has been corrected.

Finally, for 4(b), the decision in the jury case, the crucial thing is to get a decent loss function. For the correct decisions (AI and CG), the losses are zero, since these are correct decisions. You can arbitrarily set the loss for AG equal to 1. The important thing is to set the loss for CI big enough, since as you make it smaller (relative to 1), you will increase the likelihood that you’ll convict someone who is actually innocent. For example, if you pick the loss for CI equal to 2, a posterior probability of 2/3 on guilt (and hence 1/3 on innocent) would be sufficient to convict. How would you feel if you were actually innocent, and were sent to prison on evidence that still left you with a probability of 1/3 in the jury’s eyes that you were innocent? Not very well, I think. In class, we had decided that a loss of 100 in a burglary case, for example, would be appropriate. We might choose even larger losses in a murder case for which the penalty was life in prison, since the consequences of convicting an innocent person are even larger in this case.

Since some people are going home early this weekend, I will discuss the last question on Monday. Friday I will talk about how we are using Bayesian methods in our astronomical research.

April 14, 2011

I wanted to expand on the polling example. The way we did it in class, it was frequentist. How would it be done in a Bayesian way? There are two quantities we don’t know: The number of heads rolled, and the proportion of people who did drugs. We could, for example, divide the proportion of people who did drugs up into 10 or 100 discrete intervals, represented by the center of the interval. In the drawing, I’ve done it with 100 intervals centered on 0.005, 0.015,…,0.995. That can have a uniform prior. But we also don’t know the number of heads thrown, so that is also part of the SON. For a prior on that we can assume that the coin is fair, and the appropriate distribution is binomial. See the left side of the picture (sorry, a little of it got cut off).

The “choose” function is shown in the next picture. Notice that the only thing in the prior that actually varies with n is the “choose” function. The rest of the prior on n can be ignored, because we will be dividing by the marginal and any constants will cancel.

The likelihood will be zero if the number of heads is greater than 57, since we know that at least 43 tails were thrown (otherwise we could not have gotten 43 “no” responses). The “indicator” function I(n≤57) is 1 if the thing inside is true for a particular value of n, and 0 otherwise. In particular, I(n≤57) is 1 for n=0, 1, …, 57, and 0 for n=58, 59, …, 100.

Unfortunately, the last part of the likelihood we wrote is wrong! I realized this when driving home…my apologies, I decided to discuss this at the last minute. The thing is, that if there are n heads, then there are 100-n tails and we only have a binomial distribution on 100-n items, not 100 items. So, for example, if n=57 (57 heads), then there are only 100-n=43 tails and all of the tail responses must have been “no”. If n=56 then there are 44 tails, and there was one “yes” and 43 “no” responses, and so on. So the exponent on r should have been equal to (57-n), not 57 for each item. Also, because the total number of tails is changing for each line, the “choose” function should have been included; it is not constant from line to line. It should be Choose (100-n,43). So, I should have written the joint probability as follows:

C^{100}_n I(n \le 57) C^{100-n}_{43} r^{57-n}(1-r)^{43}

Anyway, with this correction, the joint distribution is a two-dimensional table with one entry for each combination of r and n. Just fill in the (corrected) product of prior and likelihood. Then the marginal is the sum of all the entries in the table, the posterior is gotten by dividing each entry by that marginal, and since we are interested only in the distribution on r (we don’t care about n), we get this by adding up each column and putting the column sum at the bottom of the table.

I spent the rest of the class showing pictures of some Bayesian meetings, and I played a Bayesian song. All of the materials, and much more, can be found here. There are YouTube videos, the Bayesian Songbook (with some skits as well), pointers to pages with music, and so on. Enjoy!

HCOL 196, April 11, 2011

April 12, 2011

Today we did the remaining bullets on the study sheet

The first bullet, where a controversial topic is being polled, is very simple. No Bayesian analysis was asked for, just a simple calculation (I may discuss a more formal calculation on Wednesday, but it won’t be on the quiz). Simply stated, we expect 50 of the 100 participants to flip a head and say “yes.” So 50 of the “yes” answers are irrelevant. Of the remaining 50 subjects, 7 said “yes” and 43 said “no.” Assuming that the subjects are sufficiently assured by the protocol that no one will be afraid to answer truthfully, this means that 7/50, or 14% of the subjects used the illegal drug in the past month.

The second bullet is to design on paper a simple “expert system” that would, for example, allow automatic spam email detection, or allow a doctor to consult with a computer to suggest diagnoses given symptoms. In the case of spam detection, the “data” are various tokens (words like “Viagra”, or “astrophysics”) that may tend or not tend to appear in spam messages. The user of the email program informs the expert system that a message is or is not spam by clicking on an appropriate icon if the message is spam. The posterior probability is on the two states of nature, “spam” and “not spam.” In the case of the expert medical system, the data are the various symptoms being presented, and the information that the system has will probably be put in by medical experts (although there is no reason why the system cannot “learn” to become better by telling it the ultimate diagnosis when one is definitively arrived at). The posterior probability will be on the various diagnoses, which may run into the thousands.

Any Bayesian system needs a prior and a likelihood. The spam system, for example, can estimate the prior from the proportion of messages identified by the user as spam. The likelihood can be estimated by determining, again from user input in the spam/not spam messages, the proportion of spam/not spam messages that contain particular tokens. Some tokens will be associated with spam, and some will be associated with nonspam messages. So the likelihood will indeed be an approximation to P(token|spam) and P(token|nonspam) for a collection of tokens.

If there is more than one token, then the likelihood for a given email is gotten by multiplying the likelihoods for the individual tokens. One student asked if that won’t make the posterior probability very low since you are multiplying numbers less than 1 and the product will be even smaller. But that’s not a problem, because you are going to divide by the marginal, and it is guaranteed that the posterior probabilities for spam/nonspam will add to 1. I asked what assumption is being made here, and another student responded that if one just multiplies the individual likelihoods, one is assuming independence. That is quite correct. The calculation is not taking into account that some tokens may be dependent on others. Nonetheless, it is remarkable that these “naive Bayes classifiers” are actually quite robust, even though the ignore the dependence issue.

The spam program and the diagnosis program “learn” when the expert (the user of the email, or the physician) enters the final decision into the system. When the email user identifies a particular message as “spam” (or by default, “not spam”), the prior on “spam” can be updated, and also the likelihood for the tokens in the message can be updated. So, in time, the system will perform better.

The WikiPedia article on Bayesian spam detection is pretty good.

Similarly, the diagnosis program will “learn” about the frequency of various diagnoses when the physician inputs the final diagnosis into the system, and it will (just as in the spam program) be able to update the likelihood for various symptoms given various diagnoses from that information.

Everyone seemed comfortable about their ability to draw decision trees as described by the last bullet.

Interesting article

April 11, 2011

Here is an interesting article about what to do about close elections. Even a recount can be inaccurate!