HCOL 195 11/16/09

On Monday the first thing I did was to flesh out a bit the calculation of the probability of saying “yes” and “no” in the polling example. It’s a straightforward application of probability theory:

Details of calculation of probabilities for survey problem

We then talked about the problem of setting up an “expert system” that takes input (patients, doctor’s diagnoses; emails, receiver’s opinion that it is spam), and after “learning” through a large amount of examples, can then do the diagnoses or decide on whether a new email is spam or not. We did this by considering the spam problem. By having the program look for a large number of words in the email, together with the recipient’s opinion that the message is or is not spam, allows us to estimate the conditional probabilities that a particular word (e.g., Viagra, hello) is in the message, given that the message is or is not spam. We do this by simply tallying up the number of occurrences and dividing appropriately. We can also get estimates of the prior probability of spam/not spam simply from the proportion of spam messages to the total.

However, the formulas I put on the board weren’t correct. I should have written that the conditional probability of a word, given that it is spam, is given by the number of times a word appears in a spam message, divided by the total number N of spam messages (I’m not sure what I said). That’s just

P(word|spam)=N(word,spam)/N(spam)=P(word,spam)/P(spam).

Here I’m just using the fact that

P(word,spam)=N(word,spam)/N(messages)

and

P(spam)=N(spam)/N(messages)

so that the number of messages, N(messages) cancels out.

Getting statistics on spam tokens

We then used Bayes’ theorem to estimate the posterior probability of a message being spam, given that it contains words w_1,w_2,...,w_n by approximating P(w_1,w_2,...,w_n \mid s) by the product of the approximate probabilities that we computed in the data-gathering phase. This approximation pretends that w_i and w_j are independent. Though it is an approximation, it turns out to be astonishingly good in practical applications. The result is a so-called naive Bayes classifier.

Calculation: Is It Spam?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: