Re: Bogofilter is my emails

From: Linus Torvalds <>
Date: 2006-09-04 09:09:38
On Sun, 3 Sep 2006, Shawn Pearce wrote:
> I'm not quite sure how to fix either message to get them to the list.
> Neither email was a patch so I'm not going to try resending them
> but I'm certainly a little curious as to how my email writing style
> twice tripped bogofilter's spam switch.

I'm surprised and disgusted that vger started using bogofilter.

Last I saw, the bogofilter approach was totally bogus, using purely 
single-word frequencies (or, more strictly, a "does word X exist or not", 
where X has often gone through what a linguist would probably call a 
"lemmatizer", ie something that turns different forms of the same word 
into its canonical word, aka "lemma") for its "bayesian" filtering.

Maybe they've enhanced it enough since, but it certainly used to be not 
only fairly easy to fool, since it at least originally didn't take any 
account at all of any more complex structure. 

There's even some papers about how the bayesian thing does not work well 
(even when extended to do some phrases and with lemmatization) if the 
cut-off is hard.

I think the bogofilter is probably an acceptable input as _one_ of many 
rules for a real spam-filter (ie as one of many spamassassin rules), but 
not for what vger does.

Hard rules at mail acceptance are much better if they use some really hard 
datum. For example, checking that the sending site actually also receives 
email, and that it resolves back to itself. That's one thing that OSDL 
does, for example, and it means that you can only send me email if your 
machine is actually designated as a MX gateway. That cuts down on a _lot_ 
of spam.

(I'd love to speak of the details, but I wouldn't know. Kees Cook set it 
all up at osdl, and I can just say that it works beautifully.)


