Date: Wed, 20 Aug 2003 15:43:55 -0400
To: mia-list@marxists.org
Subject: New anti-spam strategy for the MIA
User-Agent: Mutt/1.3.28i

[ Warning, this is a long one but well worth the read.  Please take some
time and understand what's going on.  If you're unclear, drop me a mail. ]

Okay, kids.  Given the non-stop clamor for something to deal with spam
on the server, I'd installed some software which implements the
technique described in Paul Graham's "A Plan for Spam" at

  http://www.paulgraham.com/spam.html

which has received some coverage in the past on this list.  Here are the
rules:

  . Incoming mail is broken down and the individual words are compared
    with words that appear in spam and words that occur in mail that's
    definitely not spam.  A statistical correlation of all the most
    likely and least likely indicators is made to determine whether this
    is a spam message or not.

  . If it's determined to be spam, it will be filed in a 'spam' folder
    (depending on your mail client's configuration, it may be a
    mail/spam folder).  You'll need to check that folder every now and
    then to skim through the headers to see what messages have been
    accidentally marked as spam.  This should be a fairly low percentage
    (ie, < 1%). BUT IT WILL HAPPEN! So if you never do this, be aware
    that there will be some messages that people sent you that you want
    to receive to that you're ignoring.

  . If a message is mistakenly filed as spam, delete it from that folder.
    Likewise if a spam message is NOT properly filed, manually file it
    into that folder.

  . In addition to requiring a collection of spam messages to work from,
    this technique requires some good messages for comparison.  Save
    some (many) of your good messages to a folder named 'good' in the
    same directory as the spam file.  So if your spam folder is 'spam'
    save good mail to 'good'.  If you spam folder is 'mail/spam' save
    good mail to 'mail/good'.  Additionally, if you have completely
    cleaned your inbox of spam and are prepared to commit to keeping it
    so, let me know.

    Right now David is the only person who has the good folder and the
    clean inbox (he's also the only one with a spam folder, but that
    will change shortly).  That means the examples of good mail ALL COME
    FROM DAVID'S EMAIL.  If you want to be sure that this tool works
    properly with your mail, you need to establish the good mail file or
    at the very least clean up your inbox and let me know and I'll add
    you into a list of people whose mail can be used to propagate the
    spam filter.  It would be best if you could save a few hundred or
    thousand good emails to the good file and keep them relatively
    current (to reflect the changing trends in your email).

    Note that this will NOT expose your email to other people.

Caveats:

  . I don't really know how large the dictionary will grow.  Right now
    it's a little less than 1MB. Mine at home (which works from several
    thousand messages) is just over 1MB.  It's quite likely that it
    won't grow too much larger as it's really just a collection of words
    plus a statistic so it really only grows when it encounters
    previously unknown words.  The size does impact the time it takes to
    process an incoming email though as the mail filter has to load up
    the dictionary before classifying an email as spam or not.

  . This is an adaptive mechanism that works by looking at the
    composition of good email versus spam.  As such, the more specific
    the population it targets, the more accurate it can be.  That is to
    say, it works best with one person's email.  More people will tend
    to make the signal it's working with a more diffuse.  It's not
    certain how much throwing everyone into the same pool will impact
    performance.  Certainly in the most simple case, there are emails
    that I might consider spam that another person in the MIA wouldn't,
    that's a case that can never be successfully handled so long as
    we're pooled together.  Hopefully the benefits of aggregation will
    be worth the occassional fuzziness.

Questions? Let me know.