[Previous entry: "The Immigrant"] [Main Index] [Next entry: "Soundex name matching fails again"]

12/27/2003 Archived Entry: "The Bayes engine in SpamAssassin"

I recently wrote this to my boss...seemed a shame to have this writeup hidden.


The Bayes engine works by learning 'tokens', which are defined as words and small sets of words in this case. It currently has well into the hundred-thousands of tokens stored in its database.

When it learns a token, it gives it a score based on whether the token came from 'spam' or 'ham' (ham being anything that is not spam). Of course, some tokens can be found in both types of emails, but giving a score that can be adjusted depending on where it is found over time allows it to better define what a score for a specific token should be.

That's why in a Bayes engine it's best to give it volumes and volumes of both spam and ham...currently 1545 spam emails and 1395 ham emails have been learned, and it's doing very well based on that. Much of that was hand-learned...manually telling the Bayes engine that specific emails are spam or ham. Now that it's operational though, I've set it to autolearn based on the scores that SpamAssassin gives...a score of 0.5 or less and the email is learned as ham, 7.5 or more (so we don't auto-learn emails that are borderline) and it's learned as ham.

The junk words that you are seeing in spam these days are attempts by spammers to poison the Bayes database. It's been pretty ineffective since the spammers don't seem to understand that using unusual words ("alumina cleft abacus sophism actinium" from a recent spam) and garbage (eGG3j2 8Dk46up4 1d8 362I CPbIR1p3v) doesn't do anything that affects the usual content of emails arriving here.

Powered By Greymatter