2007-12-30 · in Ideas · 172 words

Several machines I run spend most of their CPU time doing SpamAssassin analysis. SpamAssassin applies a very large set of rules to incoming messages in order to decide whether they're spam. Each rule has a positive or negative score, and the sum of all the scores for the rules that fired is compared against a threshold.

When filtering mail, once you've passed the score needed to tell a message is junk, there's no point in running rules that will increase the score more. You can therefore limit yourself to only the rules that may reduce the score. If it drops below the threshold again, you can go back to trying positive-score rules. In order to prove a message is spam, you only have to make sure you've run all the rules that might help to prove it isn't.

If we use this approach, where the positive-score rules don't always get used, then it would also make sense to keep track of which rules are most likely to fire, and try those first.