Naive Bayes with Laplacean Smoothing

In aiclass.com, we just covered Naive Bayesian Classifiers, and it couldn’t have been more perfectly timed.  Prior to that lecture series, one of the projects that I am working on required that I build a classifier for a large body of data that was getting funneled into the system.  I spent quite a bit of time searching for the best way to do this, hoping that there would be a rubygem that could save me some effort, but much to my chagrin, nothing quite fit the bill–so I started in on building my own.

The basic idea behind a Naive Bayes classifier is that we have some set of documents that have been categorized (into n categories) and want to use this information about our existing labeled documents to predict the category of new, not yet labeled, documents.  It is a pretty direct use of Bayes rule and is probably best understood through an example.

Say you have 5 documents:

  • {subject: ‘Must read!’, text: ‘Get Viagra cheap!’, label: ‘spam’}
  • {subject: ‘Gotta see this’, text: ‘Viagra.  You can get it at cut rates’, label: ‘spam’}
  • {subject: ‘Call me tomorrow’, text: ‘We need to talk about scheduling.  Call me.’, label: ‘not spam’}
  • {subject: ‘That was hilarious’, text: ‘Just saw that link you sent me’, label: ‘not spam’}
  • {subject: ‘dinner at 7′, text: ‘I got us a reservation tomorrow at 7′, label: ‘not spam’}
We have 2 spam message, and 3 real messages.  Each of these messages has a subject and some text that we can use to train our classifier.
Given a new message:
  • {subject: ‘See it to believe it’, text: ‘Best rates you’ll see’, label: ?}
What is the probability that it is a spam message?  Using Bayes rule we can compute it in the following way:
P(spam|subject,text)=\frac{P(subject,text|spam)*P(spam)}{P(subject,text|spam)*P(spam)+P(subject,text|notspam)*P(notspam)}
All of these values can be computed by inspection of the previous documents:
P(spam)=\frac{2}{5}
P(notspam)=\frac{3}{5}
Note that in the case of Naive Bayes you assume independence of your variables (which is probably not true given that the english language is structured).
P(subject,text|spam)=P(subject|spam)*P(subject|spam)
P(subject|spam)=\prod_{word \in subject}P(word|spam)
So for example, in the document we want to classify:
P('see' \in subject|spam) = \frac{1}{5}
You will note that the document to classify has some words that are not in any of the existing classified documents (e.g. ‘believe’).  This will give those conditional probabilities a value of 0, thus making the numerator 0 even though there is definitely a greater than 0 chance of this item being classified as spam.
The solution to this problem is known as Laplacean Smoothing.  In order to perform smoothing you pick some parameter k.  In our case we can set k=1.  This smoothing parameter is added to all probabilities as they are calculated and a normalizing constant is added to the denominator to make it a valid probability.
Thus, with a smoother of size 1:
P('believe' \in subject|spam)=\frac{0 + 1}{5+5*1}
Where does the 5*1 come from in the denominator?  Well we have a smoothing factor of 1 and we have 5 different known values for words in the subject, thus in order to make the known values a true probability distribution we need to add that to the denominator (so it sums to 1).
Like I said, that math here is pretty straightforward if you can buy the assumptions.  And, even if you can’t it seems to work pretty well.
So, how did I end up using it in my app?  I built a pretty simple gem to do classification called Classyfier.  It was based loosely on Bayes Motel but I cleaned up and reorganized some things (as well as added smoothing). I anticipate that I will be adding more features to this package as my need for more sophisticated classification grows.  For more info on how to use the gem see the example below or just checkout the test file.