## HCOL 195 11/16/09

On Monday the first thing I did was to flesh out a bit the calculation of the probability of saying “yes” and “no” in the polling example. It’s a straightforward application of probability theory:

Details of calculation of probabilities for survey problem

We then talked about the problem of setting up an “expert system” that takes input (patients, doctor’s diagnoses; emails, receiver’s opinion that it is spam), and after “learning” through a large amount of examples, can then do the diagnoses or decide on whether a new email is spam or not. We did this by considering the spam problem. By having the program look for a large number of words in the email, together with the recipient’s opinion that the message is or is not spam, allows us to estimate the conditional probabilities that a particular word (e.g., Viagra, hello) is in the message, given that the message is or is not spam. We do this by simply tallying up the number of occurrences and dividing appropriately. We can also get estimates of the prior probability of spam/not spam simply from the proportion of spam messages to the total.

However, the formulas I put on the board weren’t correct. I should have written that the conditional probability of a word, given that it is spam, is given by the number of times a word appears in a spam message, divided by the total number N of spam messages (I’m not sure what I said). That’s just

P(word|spam)=N(word,spam)/N(spam)=P(word,spam)/P(spam).

Here I’m just using the fact that

P(word,spam)=N(word,spam)/N(messages)

and

P(spam)=N(spam)/N(messages)

so that the number of messages, N(messages) cancels out.

Getting statistics on spam tokens

We then used Bayes’ theorem to estimate the posterior probability of a message being spam, given that it contains words $w_1,w_2,...,w_n$ by approximating $P(w_1,w_2,...,w_n \mid s)$ by the product of the approximate probabilities that we computed in the data-gathering phase. This approximation pretends that $w_i$ and $w_j$ are independent. Though it is an approximation, it turns out to be astonishingly good in practical applications. The result is a so-called naive Bayes classifier.

Calculation: Is It Spam?