Archive for April, 2011

HCOL 196, April 29, 2011

April 29, 2011

Again, I thank all of you for your work on your presentations, and today’s presentations on Prosecutor’s Fallacy and Human Medical Experimentation, both of which were thought-provoking.

Monday I plan to discuss the presentations, and I will give each of you your current grade (assuming that you don’t do anything more).

Meantime, check out this cartoon. I think you will be amused.

HCOL 196, April 27, 2011

April 28, 2011

Thanks to our two teams who presented their findings about Pascal’s Wager and the decisions about legalization of drugs.

See you all on Friday. Please be prompt. The amount of time is limited.

HCOL 196, April 25, 2011

April 26, 2011

Reminder: On Wednesday and Friday, when you do your presentations, it is very important to be prompt to class as we have a limited amount of time.

On Monday I talked about another astronomical problem that we have used Bayesian methods to solve. Like the problem we discussed last time, this one involves a special kind of variable star, this time called RR Lyrae stars (again, named for the first one of its type discovered). These are fainter than the Cepheids, but they are known to have the same luminosity. So again, if we can calibrate the luminosity, we can use these to measure the distances of objects that contain them. Again like the last time, the nearest of these stars is a little too far away to measure its distance directly, and we only were able to do this with the Hubble. But we’d like an independent check. This time we can’t use the method that we did for the Cepheids, but we know that these stars have low velocities relative to the galaxy (and thus high velocities relative to the Sun, which is moving rapidly around the galaxy). To a first approximation, consider four such stars, located at 12:00, 3:00, 6:00 and 9:00 on a clock face relative to the Sun (see picture). The stars at 12:00 and 6:00 can have their velocities towards or away from the Sun measured by Doppler shift, just as we measured the velocity of the pulsations of the Cepheid variable. These velocities don’t depend on how far away the star is. The stars at 3:00 and 9:00 aren’t moving towards or away from the Sun, they are appearing to fly by us as trees at our sides would if we were driving down the highway. They have an angular velocity, and their angular velocity depends on their distances. We can measure these angular velocities by observing a star at two times, noting the change in angle, and dividing by the time interval. Then, just as in the Cepheid case, we have an angular velocity and a linear velocity (on two stars), and this allows us to calculate the distance of the stars at 3:00 and 9:00, and thus calibrate the brightness of these stars (and by extension, all RR Lyrae stars).

It’s a bit more complicated than this of course. In the first place, we aren’t going to find stars at exactly those points, so some geometric modeling is needed. Also, the stars do not have exactly zero velocities relative to the galaxy, they do have their own motions, so we’re going to have to model that statistically as well. This is easily done using the Bayesian method.

I noted that in general we cannot use the “spreadsheet” methods of the class to solve problems like this. Although the basic ideas remain the same, if you have more than two parameters of interest, then it rapidly becomes impractical. For the past quarter century, statisticians have been using another technique, which involves using various strategies to draw a large sample from the posterior distribution and then make inferences from the sample. But we start at the same basic place. With x representing the data and \theta the states of nature, we start with a prior on \theta, a likelihood, multiply them to get the joint. The problem is that to compute a posterior distribution we need to divide the joint by the marginal, and the marginal is computed by integration in general (unless the parameters are discrete so that we can sum). But integration is hard. So statisticians have learned how to avoid that step altogether by drawing from the sample.

The basic technique is called Markov chain Monte Carlo (I didn’t name it in class). Markov chains are a particular type of mathematical object, and Monte Carlo refers to the famous casino on the Mediterranean coast. A simple example is shown in the diagram, where we are trying to draw a sample from a normal (bell-shaped) distribution. We start anywhere, with some value \theta_0. Starting there, we generate a new \theta^* from a uniform distribution between \theta_0-d and \theta_0+d. We then compute the ratio of the joint distribution at the new point to that at the old point. We generate a uniform random number between 0 and 1, and compare it to the ratio. If it is less than the ratio, we jump to the new value, so \theta_1=\theta^*, otherwise we set \theta_1=\theta_0. We then repeat the procedure as long as we wish, obtaining a large sample.

If you think about it, you can easily see that if the distribution at the new point is bigger, you will definitely move there, but if it is less, you will sometimes move there and sometimes not. This means that the points will concentrate near the peak of the curve, and in fact the sample will be a very good approximation to a sample from a normal distribution.

This is a very simple example. There is an art to designing programs that will sample efficiently, but it’s not hard to design programs that will sample effectively, given enough time.

There are many variations on these schemes. Usually it is easiest to sample on a small number of parameters at a time. You can, in the following figure, sample first on \theta_1 and then on \theta_2 and then repeat.

The method was first developed at Los Angeles National Laboratory in the 1950s, and was used to develop nuclear weapons.

HCOL 196, April 22, 2011

April 24, 2011

We postponed the presentations until Wednesday the 27th and Friday the 29th, so that we would not interfere with people’s holiday plans. Each team knows when they are scheduled to present. It is important for people to arrive on time for presentations, because the time available is limited. Please, everyone, make every effort to be on time.

On Friday I talked about work that I’ve done using Bayesian methods. I talked about a particular kind of star, Cepheid variables, which are very important in astronomy. These stars are fairly bright, and they pulsate regularly.

In 1908, Henrietta Leavitt discovered that the intrinsic luminosity of these stars depends on their pulsation periods: The longer the period, the more luminous the star is. This is called the period-luminosity relationship. Of course, as the star gets farther away, it will become fainter to us, but the intrinsic luminosity doesn’t depend on the distance of the star. This gives us a way to measure the distance to one of these stars, if we know its intrinsic luminosity: Measure its apparent luminosity, which decreases as the square of the distance to the star (the “inverse square law”), and calculate the distance from the intrinsic and apparent luminosities.

To do this, we need to calibrate the period-luminosity relationship. That is, even though we know from Leavitt’s work that there is a relationship, we do not have the tools to convert a period into an intrinsic luminosity. Leavitt was looking at these stars in objects that had a number of them (the nearby galaxies known as the Magellanic clouds), so that all the stars were at about the same distance from us. Since they were all at the same distance, the period-luminosity relationship showed up as a relationship between the periods and the apparent luminosities of these stars. But no one knew the distance to the Magellanic clouds, so this was missing a crucial piece of information, that is, the distance to at least one Cepheid variable.

The direct method of measuring the distance to a star is to observe it as the Earth goes around the Sun. Because of the Earth’s motion, the direction to the star changes slightly. If we can measure this very small change in angle, we can determine the distance to the star by trigonometry, as shown in the figure:

The trouble is that the angles are very small, less than 1 second of arc even for the nearest star (Proxima Centauri). When we started our work on the Hubble, the errors of parallaxes were about ±0.01 second of arc, which means that we could not measure distances to stars more than about 100 parsecs, or 326 light years away. Unfortunately, the nearest Cepheid is almost three times farther away than that, so another method was needed. (Since then, we have measured the distance to this star to quite high precision, but we’d like to have other, independent measures). But, as you can see from the picture of our galaxy, most stars are just too far away.

The pulsations affect both the size of the star (which expands and contracts) and the surface temperature of the star (when gases expand, they cool, and vice versa). We can measure both the apparent brightness and the temperature of the star as it expands and contracts, and that allows us to measure the angular change in the size of the star quite accurately (see “measure as angle” in the diagram below). At the same time, we can measure the linear changes in the size of the star (in meters, for example) by observing the change in the wavelengths of spectral lines emitted by the star, through the Doppler shift (the same principle that police use to detect speeders). The angular change and the linear change are related to the distance of the star, so from this information we can calculate the distance to the star (in the equations in the diagram I’ve converted to compatible units, this is just for illustration).

Our problem was fairly complicated, since the light curves of Cepheids are not nice smooth waves, but they rise slowly and fall more rapidly. We can approximate the complicated actual curve by adding up nice smooth waves of different frequencies, with appropriate amplitudes. Estimating those amplitudes is at the heart of the Bayesian program that we developed. Our results were consistent with the results of other methods.

So, with the period-luminosity relationship calibrated, astronomers are able to measure the distances to many distant objects that contain Cepheid variables (mostly galaxies). I hasten to point out that many astronomers have worked on this problem over the decades, and our work is intended to augment what others have done.

HCOL 196, April 20, 2011

April 21, 2011

We discussed the second quiz.

The first question asked about the probability that the experimental drug is better than the standard drug. Some people thought that the 0.2, 0.3, 0.4, 0.5, 0.6 were probabilities; they are not, they are the states of nature for this problem. The probabilities for the standard and experimental drugs were the numbers in the corresponding (second and third) columns. In the chart, I have put them in parentheses to the left and above the states of nature. Since the control and experimental drugs are done independently, the probability for state (r,s) is just the product of the corresponding posterior probabilities from the table. For example, for the SON (0.2, 0.3), the corresponding probabilities are 0.4 and 0.2, so the number to put in the corresponding box is 0.08. You only need the boxes above the red diagonal line, since those represent the boxes for which r>s. Then just add up those boxes and the result is 0.66. A shortcut method is to do it row by row, the answer is the same (see red numbers on the right of the picture).

There was no need (and no way) to start with prior and likelihood to get posteriors for this problem. The posteriors were already given to you. This was to save you time.

For the polling problem, the critical thing is to recognize that H on the first coin and T on the second (HT) is a different outcome from TH. The probability tree shows us that the probability of HH is equal to the probability of TT, 1/4, and the probability of a head and a tail (in either order) is 1/2. This means that we expect 25% of the answers to be “programmed” yes, corresponding to HH. Since 40% of the answers were yes, the remaining 15% were no. Similarly, 60%-25%=35% of the nos were not programmed. Of the unprogrammed answers, 30% were yes and 70% were no.

Here, we approximated the stockbroker’s claim by assuming that his actual hit rate is 80%, which is what the data said. This is obviously an approximation, since it could be some other number and still be consistent with the observed data. So the calculation is going to overestimate the probability that the stockbroker can actually do what he says. If we wanted to do better, we would put (say) half the prior probability on 65% (which says he can’t do it), and parcel out the remaining half amongst all the other possibilities. But that is too much for a 10 minute test question, hence the approximation. (Some people outlined the better scheme, and they got credit for that).

So the states of nature are two: Either the broker can’t really do better than the market, or he can (represented by 80%). In the table I tentatively put 1/2 on each possibility. The likelihoods use the probabilities 0.65 (o.35) and 0.8 (0.2), raised to the power of the number of successes (failures). The calculation isn’t hard, but I gave credit if you got this far and then explained how to complete the calculation. The result is approximately 0.4 for the posterior probability that he can’t do it and o.6 that he can. However, experienced investors know that it is really hard to beat the market, and if this were taken into account by changing the prior appropriately, it is quite likely that we would conclude that the stockbroker can’t really do what he claims.

For Problem 4(a), I realized after class that I had miscalculated the numerical value of the likelihood for “guilty”. I had apparently raised the 0.99 to the 6th power in my calculator (which has pretty small buttons), and not the 9th power. I’ll say more in a moment.

The first step is to recognize that the juror cannot use the fact that the accused was arrested in any way to set the prior probability. In particular, you can’t say that the grand jury indicted, so the probability is at least 1/2 that he is guilty, because the grand jury would have used the evidence to draw that conclusion, and you are going to see the same evidence, and a Bayesian can’t use the same evidence twice. About all you know is that there are a million people in the city, and as far as you know (since no evidence is available yet) is that the accused was randomly picked out of the population. So the prior on guilt is 1/1,000,000. You can safely write the prior on innocence as 1 (we are going to be dividing by the marginal anyway, and 1 is very close to the implied prior 999,999/1,000,000; the difference is insignificant).

The probabilities to be used in the likelihood are for innocence 0.1 (o.9) and for guilty 0.99 (0.01). Note that again, each pair of numbers add to 1; this mistake was made by several people. The corresponding numbers are raised to the 9th and 1st power, respectively, for the corresponding likelihoods.

At the end I noted that you can do the calculation of 0.999 by using a binomial expansion and dropping all but the first and second terms (hopefully you remember this from high school). That is shown at the bottom of the page. It was when I looked at this line after class that I realized I had miscalculated it on my calculator when I prepared for class discussion. The picture below has been corrected.

Finally, for 4(b), the decision in the jury case, the crucial thing is to get a decent loss function. For the correct decisions (AI and CG), the losses are zero, since these are correct decisions. You can arbitrarily set the loss for AG equal to 1. The important thing is to set the loss for CI big enough, since as you make it smaller (relative to 1), you will increase the likelihood that you’ll convict someone who is actually innocent. For example, if you pick the loss for CI equal to 2, a posterior probability of 2/3 on guilt (and hence 1/3 on innocent) would be sufficient to convict. How would you feel if you were actually innocent, and were sent to prison on evidence that still left you with a probability of 1/3 in the jury’s eyes that you were innocent? Not very well, I think. In class, we had decided that a loss of 100 in a burglary case, for example, would be appropriate. We might choose even larger losses in a murder case for which the penalty was life in prison, since the consequences of convicting an innocent person are even larger in this case.

Since some people are going home early this weekend, I will discuss the last question on Monday. Friday I will talk about how we are using Bayesian methods in our astronomical research.

April 14, 2011

I wanted to expand on the polling example. The way we did it in class, it was frequentist. How would it be done in a Bayesian way? There are two quantities we don’t know: The number of heads rolled, and the proportion of people who did drugs. We could, for example, divide the proportion of people who did drugs up into 10 or 100 discrete intervals, represented by the center of the interval. In the drawing, I’ve done it with 100 intervals centered on 0.005, 0.015,…,0.995. That can have a uniform prior. But we also don’t know the number of heads thrown, so that is also part of the SON. For a prior on that we can assume that the coin is fair, and the appropriate distribution is binomial. See the left side of the picture (sorry, a little of it got cut off).


The “choose” function is shown in the next picture. Notice that the only thing in the prior that actually varies with n is the “choose” function. The rest of the prior on n can be ignored, because we will be dividing by the marginal and any constants will cancel.

The likelihood will be zero if the number of heads is greater than 57, since we know that at least 43 tails were thrown (otherwise we could not have gotten 43 “no” responses). The “indicator” function I(n≤57) is 1 if the thing inside is true for a particular value of n, and 0 otherwise. In particular, I(n≤57) is 1 for n=0, 1, …, 57, and 0 for n=58, 59, …, 100.

Unfortunately, the last part of the likelihood we wrote is wrong! I realized this when driving home…my apologies, I decided to discuss this at the last minute. The thing is, that if there are n heads, then there are 100-n tails and we only have a binomial distribution on 100-n items, not 100 items. So, for example, if n=57 (57 heads), then there are only 100-n=43 tails and all of the tail responses must have been “no”. If n=56 then there are 44 tails, and there was one “yes” and 43 “no” responses, and so on. So the exponent on r should have been equal to (57-n), not 57 for each item. Also, because the total number of tails is changing for each line, the “choose” function should have been included; it is not constant from line to line. It should be Choose (100-n,43). So, I should have written the joint probability as follows:

C^{100}_n I(n \le 57) C^{100-n}_{43} r^{57-n}(1-r)^{43}

Anyway, with this correction, the joint distribution is a two-dimensional table with one entry for each combination of r and n. Just fill in the (corrected) product of prior and likelihood. Then the marginal is the sum of all the entries in the table, the posterior is gotten by dividing each entry by that marginal, and since we are interested only in the distribution on r (we don’t care about n), we get this by adding up each column and putting the column sum at the bottom of the table.

I spent the rest of the class showing pictures of some Bayesian meetings, and I played a Bayesian song. All of the materials, and much more, can be found here. There are YouTube videos, the Bayesian Songbook (with some skits as well), pointers to pages with music, and so on. Enjoy!

HCOL 196, April 11, 2011

April 12, 2011

Today we did the remaining bullets on the study sheet

The first bullet, where a controversial topic is being polled, is very simple. No Bayesian analysis was asked for, just a simple calculation (I may discuss a more formal calculation on Wednesday, but it won’t be on the quiz). Simply stated, we expect 50 of the 100 participants to flip a head and say “yes.” So 50 of the “yes” answers are irrelevant. Of the remaining 50 subjects, 7 said “yes” and 43 said “no.” Assuming that the subjects are sufficiently assured by the protocol that no one will be afraid to answer truthfully, this means that 7/50, or 14% of the subjects used the illegal drug in the past month.

The second bullet is to design on paper a simple “expert system” that would, for example, allow automatic spam email detection, or allow a doctor to consult with a computer to suggest diagnoses given symptoms. In the case of spam detection, the “data” are various tokens (words like “Viagra”, or “astrophysics”) that may tend or not tend to appear in spam messages. The user of the email program informs the expert system that a message is or is not spam by clicking on an appropriate icon if the message is spam. The posterior probability is on the two states of nature, “spam” and “not spam.” In the case of the expert medical system, the data are the various symptoms being presented, and the information that the system has will probably be put in by medical experts (although there is no reason why the system cannot “learn” to become better by telling it the ultimate diagnosis when one is definitively arrived at). The posterior probability will be on the various diagnoses, which may run into the thousands.

Any Bayesian system needs a prior and a likelihood. The spam system, for example, can estimate the prior from the proportion of messages identified by the user as spam. The likelihood can be estimated by determining, again from user input in the spam/not spam messages, the proportion of spam/not spam messages that contain particular tokens. Some tokens will be associated with spam, and some will be associated with nonspam messages. So the likelihood will indeed be an approximation to P(token|spam) and P(token|nonspam) for a collection of tokens.

If there is more than one token, then the likelihood for a given email is gotten by multiplying the likelihoods for the individual tokens. One student asked if that won’t make the posterior probability very low since you are multiplying numbers less than 1 and the product will be even smaller. But that’s not a problem, because you are going to divide by the marginal, and it is guaranteed that the posterior probabilities for spam/nonspam will add to 1. I asked what assumption is being made here, and another student responded that if one just multiplies the individual likelihoods, one is assuming independence. That is quite correct. The calculation is not taking into account that some tokens may be dependent on others. Nonetheless, it is remarkable that these “naive Bayes classifiers” are actually quite robust, even though the ignore the dependence issue.

The spam program and the diagnosis program “learn” when the expert (the user of the email, or the physician) enters the final decision into the system. When the email user identifies a particular message as “spam” (or by default, “not spam”), the prior on “spam” can be updated, and also the likelihood for the tokens in the message can be updated. So, in time, the system will perform better.

The WikiPedia article on Bayesian spam detection is pretty good.

Similarly, the diagnosis program will “learn” about the frequency of various diagnoses when the physician inputs the final diagnosis into the system, and it will (just as in the spam program) be able to update the likelihood for various symptoms given various diagnoses from that information.

Everyone seemed comfortable about their ability to draw decision trees as described by the last bullet.

Interesting article

April 11, 2011

Here is an interesting article about what to do about close elections. Even a recount can be inaccurate!

HCOL 196, April 8, 2011

April 9, 2011

Just to remind you, in many questions it would be impossible for you to do a complete calculation in 10 minutes. In such situations I want you to convince me that you know how to answer the question in lieu of actually doing the calculation.

The first bullet on the second side of the study sheet was the “tagged/untagged” fish survey. This is a “sampling without replacement” scenario, since we do not toss the fish back in the second part of the experiment (we don’t want to catch the same fish twice!) One student wondered if the likelihood depends on the order in which the tagged/untagged fish are caught. It doesn’t, because the denominator (in the example, for N=15) is always 15x14x13x12x…x6, no matter what the order, and if you look at the tagged (T) fish the numerator for those fish is always 10x9x8x7x6, and for the untagged fish it is always 5x4x3x2x1, regardless of the order. So, numerator and denominator are always the same.

For larger N, the denominator starts at N instead of 15, and the numerator for the untagged fish starts at (N-10) instead of 5. You can see that for N=15, we get exactly what’s in the picture below

For a prior, if you don’t have information about the size of the lake or the number of fish in it, it is probably safe to choose a power-law prior, since it’s more likely that the lake is small and has few fish, than that it is large and has many fish. A power-law prior is of the form shown in the following chart. Generally, the exponent a will be small, of order 1 or 2.

The likelihood is 0 for N<15 since we know that there are at least 15 fish in the lake. The following chart shows how one might set up the spreadsheet for power law priors with a=1 and a=2 (you have to pick one). A uniform prior is actually a power-law prior with a=0! All of these priors are defensible; I would expect you to give your reasons for picking the one you do.

The next bullet was couched in terms of a self-proclaimed psychic, but in class I noted that my intention was to distinguish between an exact null hypothesis of 0.85 for a fair die, and a possibly biased die that could have any (unknown) bias. The kind of die would be a 20-sided “dungeons and dragons” die with 3 sides marked one way (success) and 17 marked the other way (failure), for example.

Given states of nature Fair and Biased (F,B), the spreadsheet is very simple. It’s shown below. For the SON Fair, the likelihood P(D|F) is also easy to calculate and is shown at the bottom of the picture.

But for the biased die, there are 10 values of the bias, and these are also states of nature. We can calculate P(D|B) with a spreadsheet, where each possible bias is a SON. However, we have to be careful. We cannot be lazy and just put 1 for each prior, because we aren’t going to divide by the marginal at the end, which in most of our calculations cancels any arbitrary factor from the prior. We have to make sure that the prior adds up to 1. The spreadsheet for this (assuming a uniform prior, and you should defend this) assigns P(D|son,B)=0.1 to each of the values of bias we are considering. We then compute the likelihood (shown in the spreadsheet for several values of bias), the joint, and the marginal in the usual way. The marginal is what we need, since it is the value of P(D|B) that we want to put into the spreadsheet above. That allows us to finish that spreadsheet and get the posterior probabilites P(F|D), P(B|D) that we need to answer the question.

I hope to finish the review sheet on Monday. Have a nice weekend!

HCOL 196, April 6, 2011

April 6, 2011

We went through the first three bullets on the review sheet.

The first one, on the suspected cheaters, asks you to advise the professor. You can tell the professor what the likelihood is, but the professor is going to have to decide on a prior. He won’t be able to do this based on reports of the particular students, because of privacy laws. So he’s going to have to use his experience, perhaps his estimate on how common cheating is from other classes he has taught.

For the likelihood, we expect the students to answer correctly, so the 17 correct answers don’t give us any information. But the eight wrong answers do, since they were answered identically. If the distractors and the correct answers are equally plausible, the probability of one coincidence is 0.2, and that of eight coincidences is 0.28. However, it is very difficult to write good distractors, and it is likely that the real probability of one coincidence is somewhat higher, perhaps one chance in three. Whatever it is that you decide, call it p, and so the likelihood for no cheating is p8. Of course, if one student copied from the other, the likelihood for that state of nature is 1.

With prior (from the professor) and likelihood in hand, you are then able to tell the professor what the posterior probability of cheating is, and then he can take it from there as to what to do about it.

The second bullet is a problem we’ve discussed several times, so I’ll be brief. The important thing to clarify is whether the sampling of the taxis is with or without replacement. In the case of the taxi problem we discussed earlier, we know that it was with replacement, since we saw one of the taxis twice; but in this case, you can argue perhaps that the sampling was without replacement, since you saw these taxis on your way to the hotel and it’s unlikely that you’d see the same taxi twice.

If the sampling is with replacement, then the factor in the likelihood for taxi #N is 0 for n<N and 1/n for n≥N. There is one factor for each taxi, and because you saw taxi #150, the combined likelihood is 0 for n<150, and 1/n7 for n≥150. On the other hand, if the sampling is without replacement, the combined likelihood is 0 for n<150 and 1/(n(n-1)(n-2)…(n-6)) for n≥150. In practice, the answers you get will be fairly similar no matter what.

For a prior, the HINT says that it’s more likely that you got dropped into a small city with few taxis than into a large city with many taxis. So you want a prior that decreases with n. A very common prior to use in situations like this is a “power law” prior of the form 1/na, where a is a positive number, probably of order 1 or 2. Power laws describe all sorts of things, e.g., the number of stars of mass M, the number of people that have assets of $D dollars, etc. So this is a reasonable prior. You should not cut the prior off at 150 taxis, because that assumes you’ve looked at the data, and priors are supposed to be determined before looking at the data. Besides, the likelihood automatically takes care of that.

The third bullet is the test of the drug. This is a double-blind experimental design: Neither the doctor nor the patient knows if the pills the patient gets are placebo (or the old drug) and the new drug being tested. This is important, because even if a pill has no activity (it’s a sugar pill), there are psychological effects on the patient that cause even a placebo to have some effect, even if small.

The data are 25 of 50 patients cured in the “control” (placebo) group and 30 of 50 patients in the experimental group. The likelihood is shown in the whiteboard shot (for the special case r=0.25, s=0.35).

That number goes into the appropriate slot in a 100×100 grid, which could be a spreadsheet, for example, as shown by the arrow in the next chart for the likelihood:

Replacing the 0.25 and 0.35 with all ten of the points in our grid for r and for s yields the numbers we should put in the other 99 slots. That gives us the likelihood grid.

We’d need a similar grid of 100 numbers for the prior (you can imagine it lying underneath the likelihood grid for example). Multiplying corresponding grid points for the prior and the likelihood, we get a number for the joint that goes into a corresponding grid point. Add all 100 numbers up to get the marginal. Divide the marginal into each joint to get the corresponding posterior probability.

We are interested in the probability that the new drug is better than the old drug. Just add up the probabilities in the grid points where s>r, which means all of the entries in the upper right hand quadrant of the diagram, as shown in red below:

Finally, as we noted, the CEO of the drug company is going to have to decide whether to spend $100 million dollars to bring the drug to market. That $100 million is a toll gate that he has to go through if he decides to go further. Some of the things he’ll have to consider are: How big is the market, so that there will be enough patients to pay back the initial cost and yield a profit for the stockholders? How promising is the drug, really, as indicated by our preliminary study with 100 patients? If it isn’t much better than the old drug, which is quite cheap, it’s going to be hard to convince people to use the new drug (although various kinds of campaigns, TV for the patients and drug salesmen for the doctors) might change that. The company will have exclusive rights over the drug for only 17 years; is that going to be long enough to pay back our initial investment and earn a substantial profit? Etc., etc.

We’ll take up the remaining bullets on Friday.