## HCOL 196, April 11, 2011

Today we did the remaining bullets on the study sheet

The first bullet, where a controversial topic is being polled, is very simple. No Bayesian analysis was asked for, just a simple calculation (I may discuss a more formal calculation on Wednesday, but it won’t be on the quiz). Simply stated, we expect 50 of the 100 participants to flip a head and say “yes.” So 50 of the “yes” answers are irrelevant. Of the remaining 50 subjects, 7 said “yes” and 43 said “no.” Assuming that the subjects are sufficiently assured by the protocol that no one will be afraid to answer truthfully, this means that 7/50, or 14% of the subjects used the illegal drug in the past month.

The second bullet is to design on paper a simple “expert system” that would, for example, allow automatic spam email detection, or allow a doctor to consult with a computer to suggest diagnoses given symptoms. In the case of spam detection, the “data” are various tokens (words like “Viagra”, or “astrophysics”) that may tend or not tend to appear in spam messages. The user of the email program informs the expert system that a message is or is not spam by clicking on an appropriate icon if the message is spam. The posterior probability is on the two states of nature, “spam” and “not spam.” In the case of the expert medical system, the data are the various symptoms being presented, and the information that the system has will probably be put in by medical experts (although there is no reason why the system cannot “learn” to become better by telling it the ultimate diagnosis when one is definitively arrived at). The posterior probability will be on the various diagnoses, which may run into the thousands.

Any Bayesian system needs a prior and a likelihood. The spam system, for example, can estimate the prior from the proportion of messages identified by the user as spam. The likelihood can be estimated by determining, again from user input in the spam/not spam messages, the proportion of spam/not spam messages that contain particular tokens. Some tokens will be associated with spam, and some will be associated with nonspam messages. So the likelihood will indeed be an approximation to P(token|spam) and P(token|nonspam) for a collection of tokens.

If there is more than one token, then the likelihood for a given email is gotten by multiplying the likelihoods for the individual tokens. One student asked if that won’t make the posterior probability very low since you are multiplying numbers less than 1 and the product will be even smaller. But that’s not a problem, because you are going to divide by the marginal, and it is guaranteed that the posterior probabilities for spam/nonspam will add to 1. I asked what assumption is being made here, and another student responded that if one just multiplies the individual likelihoods, one is assuming independence. That is quite correct. The calculation is not taking into account that some tokens may be dependent on others. Nonetheless, it is remarkable that these “naive Bayes classifiers” are actually quite robust, even though the ignore the dependence issue.

The spam program and the diagnosis program “learn” when the expert (the user of the email, or the physician) enters the final decision into the system. When the email user identifies a particular message as “spam” (or by default, “not spam”), the prior on “spam” can be updated, and also the likelihood for the tokens in the message can be updated. So, in time, the system will perform better.

The WikiPedia article on Bayesian spam detection is pretty good.

Similarly, the diagnosis program will “learn” about the frequency of various diagnoses when the physician inputs the final diagnosis into the system, and it will (just as in the spam program) be able to update the likelihood for various symptoms given various diagnoses from that information.

Everyone seemed comfortable about their ability to draw decision trees as described by the last bullet.