[NBLUG/talk] sa-learn system wide...

Tue Jul 15 17:03:00 PDT 2003

(yes, I'm replying to two different messages in one message; they were
related and the In-Reply-To header is correct)

On Tue, Jul 15, 2003 at 12:34:58PM -0700, ME wrote:
> A better question would be this:
> 
> Is is possible to enable bayes filtering and have sa-lern work when SA
> settings are stored in a sql db?

Yes.  The Bayes stuff is, of course, not in the SQL DB.

Something like: "spamd -d -x -H /var/local/spamassassin -q -a", along with
setting auto_whitelist_path, bayes_path, auto_whitelist_file_mode,
bayes_file_mode and use_bayes.

> If it is possible to do this, then how can you guarantee locking of a
> shared db (the bayes db) across multiple servers?

Mail::SpamAssassin::Bayes delegates this task to
Mail::SpamAssassin::BayesStore which uses the interface in
Mail::SpamAssassin::Locker on either a Mail::SpamAssassin::Win32Locker or
Mail::SpamAssassin::UnixLocker object.

In other words: somebody already thought of that and it's handled.

In other other words:
I was gonna quote the safe_lock routine from Mail::SpamAssassin::UnixLocker,
but suffice it to say that it's NFS-safe, automatically retries and should
be capable of automatically detecting (and breaking) stale locks.  Upon
further examination, one of the people who thought of this before you did
and made sure it was handled was Kelsey Cummings from our own Sonic.net.  If
there's any chance of sa-learn taking more than 10 minutes you could run
into a problem.

(Also, make sure you have the BerkeleyDB libraries and DB_File -- I believe
those default to a semi-reasonable behavior even if you don't have a locking
mechanism like SA uses)

On Tue, Jul 15, 2003 at 01:39:20PM -0700, Dustin Mollo wrote:
> yes, they can, but they don't at the moment.  I'm more concerned about a
> shared/replicated database.  ok...maybe not concerned about how to get it
> between servers, but more concerned about unruly users (if you know where i
> work, bet you can guess who i mean by that) poisoning the database with false
> emails one way or the other and thereby making the database less useful to
> the user population as a whole.  the last thing i need is for the
> president's email to his vp's to end up marked as spam.

It'd be tricky, but it's possible to have a separate Bayes DB for each
user...  I'd estimate that that would multiply your synchronization efforts
by 2-10 times.  (the computer would have to work even harder)

Or in the global config, "whitelist_from president" (vp, etc -- or entire
domain) -- possibly better as "whitelist_from_rcvd president
presidents_hostname".  "whitelist_to big_list" might work instead of or in
addition to that.

Or there could be some sort of thing where the "confirmed spam" and
"confirmed non-spam" IMAP folders are merely *submission* folders which must
be reviewed by a second human (presumably from a smaller set of trusted
people) who has access to the "really truly spam" IMAP folder....

If you save the emails, as well as piping them to sa-learn, you can always
rerun sa-learn with the correct classification later if need be....

Or do all of the above:

First, set up appropriate global whitelisting.  Users can have the option to
"unwhitelist" or "blacklist" (or both) those addresses...  Then set up a
shared-folders mechanism that normal users can drop a message into but not
move back out.  Have sa-learn learn off of the newly added messages (via
fetchmail?) fairly immediately -- possibly use "spamassassin --report"
instead of sa-learn --spam.  Have a nightly (or weekly or monthly; depends
on performance needs) process relearn from the two mail folders.  Then have
a group of people who can reclassify (you'll need a third "not quite either"
category) anything that's been miscategorized.  If you use "spamassassin
--report" on the spam messages, it's probably best to use "spamassassin
--revoke" on the recategorized now-ham messages.  Run "sa-learn --forget"
over the uncategorizable messages.

Remember:
"sa-learn --ham <msg> ; sa-learn --spam <msg>" == "sa-learn --spam <msg>", 
"sa-learn --spam <msg> ; sa-learn --ham <msg>" == "sa-learn --ham <msg>", 
"sa-learn --ham <msg> ; sa-learn --ham <msg>" == "sa-learn --ham <msg>", 
"sa-learn --spam <msg> ; sa-learn --spam <msg>" == "sa-learn --spam <msg>", 
"sa-learn --spam <msg> ; sa-learn --forget <msg>" == "",
"sa-learn --ham <msg> ; sa-learn --forget <msg>" == "", and
"sa-learn --ham <msg>" == "sa-learn --forget <msg> ; sa-learn --ham <msg>".

(in other words, whenever you sa-learn, it undoes any previous learning done
on that specific message and then learns it as what you told it to do. 
Except it will just silently ignore you if you try to relearn a message the
same way as it's currently recorded)

Hmmm....

Seems to me that the best (or at least most intriguing) solution (in terms
of shared spam filtering) would be a razor-style shared DB with a
credibility rating/reputation score for each user and a shared DB of rated
tokens instead of a shared DB of rated checksums.  Unfortunately that would
suck up hundreds to thousands of times more resources (server, client and
network) than something like razor.

> i haven't been sold on the idea this is something we want to do, but i'm
> trying to explore the technical issues first to make sure there aren't any
> major show-stoppers before trying to tackle the logistical and political
> issues.  sounds like it wouldn't be that hard, technically.

Organization-wide Bayes filtering isn't nearly as useful as
individually-tuned Bayes filtering.  And there's always that immature
wannabe-proto-hacker who'll want to miscategorize things just because they
can.  Or the confused user who gets it backwards every time.

> > Does "multiple platforms" mean some spamassassin servers are little-endian
> > and some are big-endian?
> 
> sorry - "multiple platforms" should be "multiple servers".

Oh, good.

If they're the identically configured machines I think they probably are,
you can safely do it via a shared NFS mount or rsync or whatever...
-- 
Eric Eisenhart
NBLUG Co-Founder & Vice-President Pro Tempore
The North Bay Linux Users Group
http://nblug.org/
eric at nblug.org, IRC: Freiheit at freenode, AIM: falschfreiheit, ICQ: 48217244
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://nblug.org/pipermail/talk/attachments/20030715/d2f9e672/attachment.pgp