Date: Wed, 20 Aug 2003 15:43:55 -0400 To: mia-list@marxists.org Subject: New anti-spam strategy for the MIA User-Agent: Mutt/1.3.28i [ Warning, this is a long one but well worth the read. Please take some time and understand what's going on. If you're unclear, drop me a mail. ] Okay, kids. Given the non-stop clamor for something to deal with spam on the server, I'd installed some software which implements the technique described in Paul Graham's "A Plan for Spam" at http://www.paulgraham.com/spam.html which has received some coverage in the past on this list. Here are the rules: . Incoming mail is broken down and the individual words are compared with words that appear in spam and words that occur in mail that's definitely not spam. A statistical correlation of all the most likely and least likely indicators is made to determine whether this is a spam message or not. . If it's determined to be spam, it will be filed in a 'spam' folder (depending on your mail client's configuration, it may be a mail/spam folder). You'll need to check that folder every now and then to skim through the headers to see what messages have been accidentally marked as spam. This should be a fairly low percentage (ie, < 1%). BUT IT WILL HAPPEN! So if you never do this, be aware that there will be some messages that people sent you that you want to receive to that you're ignoring. . If a message is mistakenly filed as spam, delete it from that folder. Likewise if a spam message is NOT properly filed, manually file it into that folder. . In addition to requiring a collection of spam messages to work from, this technique requires some good messages for comparison. Save some (many) of your good messages to a folder named 'good' in the same directory as the spam file. So if your spam folder is 'spam' save good mail to 'good'. If you spam folder is 'mail/spam' save good mail to 'mail/good'. Additionally, if you have completely cleaned your inbox of spam and are prepared to commit to keeping it so, let me know. Right now David is the only person who has the good folder and the clean inbox (he's also the only one with a spam folder, but that will change shortly). That means the examples of good mail ALL COME FROM DAVID'S EMAIL. If you want to be sure that this tool works properly with your mail, you need to establish the good mail file or at the very least clean up your inbox and let me know and I'll add you into a list of people whose mail can be used to propagate the spam filter. It would be best if you could save a few hundred or thousand good emails to the good file and keep them relatively current (to reflect the changing trends in your email). Note that this will NOT expose your email to other people. Caveats: . I don't really know how large the dictionary will grow. Right now it's a little less than 1MB. Mine at home (which works from several thousand messages) is just over 1MB. It's quite likely that it won't grow too much larger as it's really just a collection of words plus a statistic so it really only grows when it encounters previously unknown words. The size does impact the time it takes to process an incoming email though as the mail filter has to load up the dictionary before classifying an email as spam or not. . This is an adaptive mechanism that works by looking at the composition of good email versus spam. As such, the more specific the population it targets, the more accurate it can be. That is to say, it works best with one person's email. More people will tend to make the signal it's working with a more diffuse. It's not certain how much throwing everyone into the same pool will impact performance. Certainly in the most simple case, there are emails that I might consider spam that another person in the MIA wouldn't, that's a case that can never be successfully handled so long as we're pooled together. Hopefully the benefits of aggregation will be worth the occassional fuzziness. Questions? Let me know.