RSS


nabber timeline


SpamAssassin with IMAP


This whole project started when I wanted to send a copy of all of my e-mail to my cell phone as a text message (Verizon phones all have e-mail addresses, @vtext.com). The problem is, each message costs 2 cents to receive and I didn't want to be paying for spamming my own cell phone. So my solution was to setup SpamAssassin on my mail server to filter it all out first using Bayesian filtering. I also didn't want to go through the pain of forwarding every false positive and negative message back through SpamAssassin to help it learn. I also didn't want the Subject line modified with the standard SPAM header. Since I was using an IMAP server, I figured there must be an easier way. As it turns out, SpamAssassin is relatively easy to configure to do this.

This tutorial provides instructions on how to setup SpamAssassin in single user mode for an IMAP folder based learning.

1. Install SpamAssassin as usual if it isn't installed already.

2. Run this script on your existing e-mail and add this script to your crontab. Each time it runs, SpamAssassin will learn according to the messages in those folders. The first run will probably take a long time.

#!/bin/sh
# sa-learn.sh
# use the mbox flag only if your folders are in mbox format
sa-learn --no-rebuild --mbox --ham ~/mail/Inbox
sa-learn --no-rebuild --mbox --spam ~/mail/Spam
sa-learn --rebuild
Check your results by typing 'sa-learn --dump magic' at a prompt. If you don't have at least 200 ham and spam messages, you need to make changes to your user_prefs file to help speed up the learning process.

3. Edit your .procmailrc file. This is what passes the e-mail messages to SpamAssassim and moves them into the Spam folder. It should look something like this:
#~/.procmailrc
# SpamAssassin sample procmailrc
#
# Pipe the mail through spamassassin (replace 'spamassassin' with 'spamc'
# if you use the spamc/spamd combination)
#
# The condition line ensures that only messages smaller than 250 kB
# (250 * 1024 = 256000 bytes) are processed by SpamAssassin. Most spam
# isn't bigger than a few k and working with big messages can bring
# SpamAssassin to its knees.
#
# The lock file ensures that only 1 spamassassin invocation happens
# at 1 time, to keep the load down.
#
:0fw: spamassassin.lock
* < 256000
| spamassassin

# All mail tagged as spam (eg. with a score higher than the set threshold)
# is moved to "Spam".
:0:
* ^X-Spam-Status: Yes
mail/Spam	#Your Spam folder name here

# vtext.com compatible forwarding code
# get "From" address and store
:0 h
FROM=|formail -IReply-To: -rtzxTo:
:0c:
# This filters out most system messages
* !^From: .*\@localhost.*
* !^From: .*(postmaster|MAILER-DAEMON)\@.*
# Must specify some addresses to sendmail so it decodes properly for SMS text messages
# -r is for where any error messages should go
# For Verizon, they will strip out anything past the first 160 characters
#| /usr/sbin/sendmail -r myrealemailaddress@mydomain.com -f $FROM number@vtext.com
# For Cingular/AT&T, they send as many SMS messages as it takes for the entire email to get to your phone, this limits to 1 message (160 characters) per email
| /usr/bin/mailtextbody | formail -I "Subject: $SUBJECT" -I "From: $FROM" | head -c 162 | /usr/sbin/sendmail -r myrealemailaddress@mydomain.com -f $FROM number@cingularme.com
# Cingular text format,		normal mail format,		character difference
# FRM:<from>		From: <from>		+2
# SUBJ:<subject>		Subject: <subject>	+4
# MSG:<body>		<body>			-4
#								+2 total

# Normal code for forwarding a copy of  ham messages, not fully compatible with vtext.com
#:0c:
# ! forwardaddress@domain.com

# Work around procmail bug: any output on stderr will cause the "F" in 
"From"
# to be dropped.  This will re-add it.
:0
* ^^rom[ ]
{
  LOG="*** Dropped F off From_ header! Fixing up. "

  :0 fhw
  | sed -e '1s/^/F/'
}
4. Edit your user_prefs file. This is where you customize the SpamAssassin settings.
#~/.spamassassin/user_prefs
# SpamAssassin user preferences file.  See 'perldoc Mail::SpamAssassin::Conf'
# for details of what can be tweaked.
#########################################################

# How many hits before a mail is considered spam.
#required_hits           5

report_safe 0

# You may need to set these lower than the default 200 early on to get 
# SpamAssassin to start filtering, depending on your inital training from step 1.
#bayes_min_ham_num 10
#bayes_min_spam_num 10

# Whitelist and blacklist addresses are now file-glob-style patterns, so
# "friend@somewhere.com", "*@isp.com", or "*.domain.net" will all work.
# whitelist_from        someone@somewhere.com

# Add your own customised scores for some tests below.  The default scores are
# read from the installed spamassassin rules files, but you can override them
# here.  To see the list of tests and their default scores, go to
# http://spamassassin.org/tests.html .
#
# score SYMBOLIC_TEST_NAME n.nn

# I raised these scores to more effectively filter out spam
# Add more as you see fit
score BAYES_50 5.1
score BAYES_56 5.1
score BAYES_60 5.1
score BAYES_70 5.1
score BAYES_80 5.1
score BAYES_90 5.1
# This will be sure to filter all ADV messages
score ADVERT_CODE 5.1
score ADVERT_CODE2 5.1
5. Now all you need to do is periodically check your Spam folder for false positives and move them to the Inbox and move any spam in the Inbox to the Spam folder. The sa-learn.sh script will run and re-learn the messages into the proper group. If you continue to have problems consult the SpamAssassin documentation.