Short: Junk E-Mail Classifier.
Author: agmsmith@ncf.ca (Alexander G. M. Smith)
Uploader: agmsmith@ncf.ca (Alexander G. M. Smith)
Website: http://web.ncf.ca/au829/index.html
Version: 1.73
Type: internet & network/e-mail
Requires: BeOS 5.0+
Related things: http://www.paulgraham.com/spam.html, http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
AGMSBayesianSpam is a set of BeOS programs for classifying e-mail messages and other text as either spam or genuine. "Spam" is the colloquial name for unwanted junk messages, usually advertising. The name comes from a 1970's Monty Python comedy skit involving lots of unwanted Spam, which is the name for the spicy ham in a can made by the Hormel Foods company, originally from Austin, Minnesota, USA. The program classifies messages as spam or genuine (sometimes called "ham"), based on the words they contain and previous messages which have been identified by the user as spam or genuine. It's implemented as a server program (AGMSBayesianSpamServer) which keeps track of the word list and a Mail Daemon Replacement add-on (AGMSBayesianSpamFilter) which uses the server to classify incoming messages. Theoretically other programs, like a news reader, could also use the word database using the scripting interface. There's also a command line interface and a graphical user interface.
If you want to know more about the technique of counting words, have a look at Paul Graham's wonderful write-up at http://www.paulgraham.com/spam.html. This program is currently using an improved version of Graham's method, called Gary-combining, by Gary Robinson. See http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html for Gary's story. There's also an even more improved method called Chi-Squared (χ²) combining, which grew from discussions on the Spambayes mailing list.
When you're done, it should look something like this:
Run the AGMSBayesianSpamServer program again. This time it shouldn't complain. Click the "Create" button to make a new database with the default name of "/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database".
Use the "Add Example of Spam/Genuine" button, and only select at most
80 files at a time (otherwise the Tracker/File Requester will lock up and
you'll have to reboot your computer). It will ask you to identify each
file as spam or genuine, you also have the choice of identifying a whole
batch of them as all spam or all genuine.
You can also drag and drop example messages into the bottom half of
the window. Drop in the left side for genuine, right side for spam, but
avoid the middle third of the window.
If you have thousands of messages, use the command line mode.
Check for e-mail as usual. If you look at the inbox directory in Tracker, you can add an extra column with the E-mail attribute "Spam/Genuine Estimate" to see how spammy the messages are. 0.0 means the system thinks the message is fully genuine, 1.0 fully spam. But it can be wrong, for things like a friend of yours quoting a spam message. For the Chi-squared method (the default), you see numbers close to zero for genuine (like 9.750e-13), close to 1 for spam and in-between (0.01 to 0.99) if it can't decide. With the Robinson scoring method, usually if it is over 0.56 (the best cutoff value depends a bit on your database quality, but 0.56 is typical) then it is spam, and the closer it is to 1.0 the more likely it really is spam.
I sort by spam ratio, and manually throw away the messages that are spammy, then I switch the Tracker window back to sorting by thread+date (just a click on the appropriate column title does it) and get on with reading the mail.
If you turned on the filter option to modify the subject, you'll see spam messages with something like [Spam 95%] in front of the subject (I don't use it because it looks ugly). But only in the Tracker display of the Subject, the actual subject inside the message isn't affected, just the MAIL:subject attribute, which is what the Tracker shows.
The accuracy is only as good as your database, so update it with
more example spam and genuine messages. In particular, if it gets the estimate
wrong, add that message to the database to tell it what it should be doing. A
quick way to do that is to right click on the e-mail in Tracker, and pick Open
With... AGMSBayesianSpamServer.
It should start up and ask you if the message is spam or genuine.
You can also drag and drop the message into the left third of the word list for
genuine messages, or right third for spam messages. Dropping in the middle
third does something else that's mostly harmless and fun.
You may also want to train it with all your messages (it gives slightly better results in the long run than just training on the mistakes). To make it easier, turn on the self-training option in the mail filter. It will compute the spam ratio of new mail messages, then feed back the same message into the database as an example of spam/genuine. When it gets it wrong, you should manually retrain it with the correct classification, otherwise the database will get worse and worse and finally turn into mush.
If you're annoyed by the server window popping up whenver the system checks
for e-mail, you can tell it to hide. Just click the "Server Mode" checkbox.
Actually, that's now the default since people were complaining about the window
getting in the way. The disadvantage is that you don't get to see error
messages. To make it visible again, start up AGMSBayesianSpamServer (possibly
by double clicking on its icon in /boot/home/config/bin/ and bring up the
hidden window by using the deskbar, or by using the "Edit Server Settings"
button in the spam filter configuration).
Besides the graphical user interface, there is also a command line mode. Just type "AGMSBayesianSpamServer help" in the terminal to get a list of the commands and what they do (the ultimate documentation). It also explains all of the mysterious options you see in the graphical user interface. The same commands can be used in scripting, either from some other program or via the "hey" utility which you can get from http://www.bebits.com/app/2042. A useful command, if you have a lot of spam messages to add, is "AGMSBayesianSpamServer set genuine *" which will use all messages in the current directory as examples of genuine text.
Fri Dec 13 15:57:22 568 /tmp>AGMSBayesianSpamServer help AGMSBayesianSpamServer - A Spam Database Server Copyright © 2002 by Alexander G. M. Smith. Released to the public domain. Compiled on Dec 13 2002 at 15:56:20. $Revision: 1.7 $ $Header: /cvsroot/bemaildaemon/AGMSBayesianSpamServer/AGMSBayesianSpamServer.cpp,v 1.73 2002/12/13 20:55:57 agmsmith Exp $ This is a program for classifying e-mail messages as spam (junk mail which you don't want to read) and regular genuine messages. It can learn what's spam and what's genuine. You just give it a bunch of spam messages and a bunch of non-spam ones. It uses them to make a list of the words from the messages with the probability that each word is from a spam message or from a genuine message. Later on, it can use those probabilities to classify new messages as spam or not spam. If the classifier stops working well (because the spammers have changed their writing style and vocabulary, or your regular correspondants are writing like spammers), you can use this program to update the list of words to identify the new messages correctly. The original idea was from Paul Graham's algorithm, which has an excellent writeup at: http://www.paulgraham.com/spam.html Gary Robinson came up with the improved algorithm, which you can read about at: http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html Then he, Tim Peters and the SpamBayes mailing list developed the Chi-Squared test, see http://mail.python.org/pipermail/spambayes/2002-October/001036.html for one of the earlier messages leading from the central limit theorem to the current chi-squared scoring method. Thanks go to Isaac Yonemoto for providing a better icon. Usage: Specify the operation as the first argument followed by more information as appropriate. The program's configuration will affect the actual operation (things like the name of the database file to use, or whether it should allow non-email messages to be added). In command line mode it will do the operation and exit. In GUI/server mode a command line invocation will just send the command to the running server. You can also use BeOS scripting (see the "Hey" command which you can get from http://www.bebits.com/app/2042 ) to control the Spam server. And finally, there's also a GUI interface which shows up if you start it without any command line arguments. Commands: Quit Stop the program. Useful if it's running as a server. Get DatabaseFile Get the pathname of the current database file. The default name is something like B_USER_SETTINGS_DIRECTORY / AGMSBayesianSpam / AGMSBayesianSpamServer Database Set DatabaseFile NewValue Change the pathname of the database file to use. It will automatically be converted to an absolute path name, so make sure the parent directories exist before setting it. If it doesn't exist, you'll have to use the create command next. Create DatabaseFile Creates a new empty database, will replace the existing database file too. Delete DatabaseFile Deletes the database file and all backup copies of that file too. Really only of use for uninstallers. Count DatabaseFile Returns the number of words in the database. Set Spam NewValue Adds the spam in the given file (specify full pathname to be safe) to the database. The words in the files will be added to the list of words in the database that identify spam messages. The files processed will also have the attribute MAIL:classification added with a value of "Spam" or "Genuine" as specified. They also have their spam ratio attribute updated, as if you had also used the Evaluate command on them. If they already have the MAIL:classification attribute and it matches the new classification then they won't get processed (and if it is different, they will get removed from the statistics for the old class and added to the statistics for the new one). You can turn off that behaviour with the IgnorePreviousClassification property. The command line version lets you specify more than one pathname. Count Spam Returns the number of spam messages in the database. Set SpamString NewValue Adds the spam in the given string (assumed to be the text of a whole e-mail message, not just a file name) to the database. Set Genuine NewValue Similar to adding spam except that the message file is added to the genuine statistics. Count Genuine Returns the number of genuine messages in the database. Set GenuineString NewValue Adds the genuine message in the given string (assumed to be the text of a whole e-mail message, not just a file name) to the database. Set IgnorePreviousClassification NewValue If set to true then the previous classification (which was saved as an attribute of the e-mail message file) will be ignored, so that you can add the message to the database again. If set to false (the normal case), the attribute will be examined, and if the message has already been classified as what you claim it is, nothing will be done. If it was misclassified, then the message will be removed from the statistics for the old class and added to the stats for the new classification you have requested. Get IgnorePreviousClassification Find out the current setting of the flag for ignoring the previously recorded classification. Set ServerMode NewValue If set to true then error messages get printed to the standard error stream rather than showing up in an alert box. It also starts up with the window minimized. Get ServerMode Find out the setting of the server mode flag. Flush Writes out the database file to disk, if it has been updated in memory but hasn't been saved to disk. It will automatically get written when the program exits, so this command is mostly useful for server mode. Set PurgeAge NewValue Sets the old age limit. Words which haven't been updated since this many message additions to the database may be deleted when you do a purge. A good value is 1000, meaning that if a word hasn't appeared in the last 1000 spam/genuine messages, it will be forgotten. Zero will purge all words, 1 will purge words not in the last message added to the database, 2 will purge words not in the last two messages added, and so on. This is mostly useful for removing those one time words which are often hunks of binary garbage, not real words. This acts in combination with the popularity limit; both conditions have to be valid before the word gets deleted. Get PurgeAge Gets the old age limit. Set PurgePopularity NewValue Sets the popularity limit. Words which aren't this popular may be deleted when you do a purge. A good value is 5, which means that the word is safe from purging if it has been seen in 6 or more e-mail messages. If it's only in 5 or less, then it may get purged. The extreme is zero, where only words that haven't been seen in any message are deleted (usually means no words). This acts in combination with the old age limit; both conditions have to be valid before the word gets deleted. Get PurgePopularity Gets the purge popularity limit. Purge Purges the old obsolete words from the database, if they are old enough according to the age limit and also unpopular enough according to the popularity limit. Get Oldest Gets the age of the oldest message in the database. It's relative to the beginning of time, so you need to do (total messages - age - 1) to see how many messages ago it was added. Set Evaluate NewValue Evaluates a given file (by path name) to see if it is spam or not. Returns the ratio of spam probability vs genuine probability, 0.0 meaning completely genuine, 1.0 for completely spam. Normally you should safely be able to consider it as spam if it is over 0.56 for the Robinson scoring method. For the ChiSquared method, the numbers are near 0 for genuine, near 1 for spam, and anywhere in the middle means it can't decide. The program attaches a MAIL:ratio_spam attribute with the ratio as its float32 value to the file. Also returns the top few interesting words in "words" and the associated per-word probability ratios in "ratios". Set EvaluateString NewValue Like Evaluate, but rather than a file name, the string argument contains the entire text of the message to be evaluated. ResetToDefaults Resets all the configuration options to the default values, including the database name. InstallThings Creates indices for the MAIL:classification and MAIL:ratio_spam attributes on all volumes which support BeOS queries, identifies them to the system as e-mail related attributes (modifies the text/x-email MIME type), and sets up the new MIME type (text/x-vnd.agmsmith.spam_probability_database) for the database file. Also registers names for the sound effects used by the separate filter program (use the installsound BeOS program or the Sounds preferences program to associate sound files with the names). Set TokenizeMode NewValue Sets the method used for breaking up the message into words. Use "Whole" for the whole file (also use it for non-email files). The file isn't broken into parts; the whole thing is converted into words, headers and attachments are just more raw data. Well, not quite raw data since it converts quoted-printable codes (equals sign followed by hex digits or end of line) to the equivalent single characters. "PlainText" breaks the file into MIME components and only looks at the ones which are of MIME type text/plain. "AnyText" will look for words in all text/* things, including text/html attachments. "AllParts" will decode all message components and look for words in them, including binary attachments. "JustHeader" will only look for words in the message header. "AllPartsAndHeader", "PlainTextAndHeader" and "AnyTextAndHeader" will also include the words from the message headers. Get TokenizeMode Gets the method used for breaking up the message into words. Set ScoringMode NewValue Sets the method used for combining the probabilities of individual words into an overall score. "Robinson" mode will use Gary Robinson's nth root of the product method. It gives a nice range of values between 0 and 1 so you can see shades of spaminess. The cutoff point between spam and genuine varies depending on your database of words (0.56 was one point in some experiments). "ChiSquared" mode will use chi-squared statistics to evaluate the difference in probabilities that the lists of word ratios are random. The result is very close to 0 for genuine and very close to 1 for spam, and near the middle if it is uncertain. Get ScoringMode Gets the method used for combining the individual word ratios into an overall score. ProcessArgs: The property specified isn't known or doesn't support the requested action (usually means it is an unknown command), error code $FFFFFFFF/-1 (General OS error) has occured. AGMSBayesianSpamServer shutting down... Fri Dec 13 15:57:25 569 /tmp>
Another advanced trick is to load the list of words into Gobe Productive's spreadsheet, so that you can find the most popular word or chart the word frequencies. Unfortunately it can only handle about 16000 words. To do that, start up Gobe Productive, pick Open, then from the file requester's "Document Type" menu, pick "Spreadsheet" and then in the submenu pick "Tab-delimited text". Then navigate to the database, the default location is "/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database". Have fun!
The word display tells you more than you need to know about the words in the database. Those colour bars actually mean something.
Obviously words which are more genuine than spamish show up in blue, while spammier words are in red. It's proportionally based on the total message counts so that a word which shows up in 10% of the genuine messages and 9% of the spam will show up in blue, even if it was in more spam messages than genuine messages (this compensates a bit for not training on an equal number of spam and genuine messages). The length of the bar shows the ratio of the proportions; further to the left for larger genuine proportions, and similarly further right for larger spam proportions.
The thickness of the bar shows how many messages the word was found in. It's kind of a weight, saying how frequently used that word is and thus how significant it is.
The paleness of the bar shows you how old that word is. A light colour means that the word was last added to the database long ago. A darker, more saturated colour means that the word was added more recently, when you added example messages to the database.
Finally, if you click on the word display, the background will change from a
pale blue tint into solid white, to show you that it is the active keyboard
focus. That means you can type in letters to find a particular word (delay for
one second to start typing the letters for a new word). The arrow keys, page
up/down keys and the mouse scroll wheel also show you different words. Sorry,
there's no scroll bar since finding the Nth word is a slow operation with a set
of words (they aren't numbered); each twitch of the scroll bar would mean going
through the list of tens or even hundreds of thousands of words and counting to
find the scroll position.
I did some tests with tokenizing different parts of mail messages to see what would work best.
The Database:
341 training genuine messages, 406 training spam messages (or 398 when
parsing due to a bug (fixed later on in 2.0.0b5) with messages that don't have
body text).
40 test genuine messages, 40 test spam messages, all more recent than the
training ones.
Spam threshold is 0.56, Gary-combining method.
The results:
Tokenizing Method | Genuine Test Details | Genuine Accuracy | Spam Test Details | Spam Accuracy |
---|---|---|---|---|
Just headers | Genuine .181352 to .557881, one false positive (a mailbox full announcement). | 2.5% wrong. | Spam .450602 to .750511, 21 false negatives. | 52.5% wrong. |
Whole raw message text | Genuine .163027 to .627022, 3 false positives. | 7.5% wrong. | Spam .509355 to .993985, 1 false negative. | 2.5% wrong. |
Message parsed into parts plus header | Genuine .168857 to .609005, 4 false positives. | 10% wrong. | Spam .614564 to .994364, 0 false negatives. | 0% wrong. |
Message parsed into parts, no header data | Genuine .220161 to .631161, 5 false positives. | 12.5% wrong. | Spam .592501 to .994444, 0 false negatives. | 0% wrong. |
Any text parts and header | Genuine .162697 to .614136, 4 false positives. | 10% wrong. | Spam .614973 to .994362, 0 false negatives. | 0% wrong. |
Any text parts, no headers | Genuine .221923 to .635487, 6 false positives. | 15% wrong. | Spam .594271 to .994441, 0 false negatives. | 0% wrong. |
text/plain parts (including body text) | Genuine .137869 to .583192, 3 false positives. | 7.5% wrong. | Spam .448059 to .994119, 17 false negatives. | 42.5% wrong. |
Only text/plain sub-parts, no headers. 150 spam and 1 genuine training message had no words! | Genuine .219169 to .696899, 9 false positives. | 22.5% wrong. | Spam .660755 to .994116, 0 false negatives, 27 had no words. | 0% wrong. |
The results look good for the whole message tokenizing method (which also works on non-email files) and for the all text parts plus header. Since the text parts method doesn't add lots of garbage words to the database from trying to find words in binary attachments, it's now the default setting.
The header only method is pretty good too for identifying genuine messages, and so-so for spam messages. That may make it useable for pre-download tests (delete some of the spam on the mail server before downloading it, without worrying about deleting too many genuine messages).
If you have a slow dial-up connection, you may wish to classify your mail quickly by deleting spam messages without downloading the entire junk message.
This can be done with three settings. First switch the AGMSBayesianSpamServer into tokenizing just the headers. Then go into the E-mail preferences and add an AGMSBayeisianSpamFilter with the "Add [Spam %] in Front of Subject" option turned on, and the ratio set to a nice safe high level like 0.95 (so that your genuine mail is less likely to get deleted, but it will still delete the 1% of your real mail that looks like spam, which is why this is dangerous). Finally in the E-mail preferences, add a "Match Header" filter after the spam filter and set it so that If Subject is \[Spam* then Delete Message. That's backslash, left square bracket, Spam with the S capitalised, asterix. Now it will download the headers, check them against the spam database, and then delete the spam ones on the server without downloading the rest of their contents.
You should also make a new spam database trained in Just Headers tokenizing mode with roughly equal examples of your genuine messages and spam messages (50 of each should be enough to start). A full message database may also work, but headers only training should be more accurate for headers only decisions. When testing JustHeader mode, I noticed that the false positive rate (genuine reported as spam) is nice and low, but the false negative rate (spam reported as genuine) is high (tested with Robinson scoring, not Chi-Squared scoring). So this means JustHeader mode will delete maybe half the spam (and download the rest) and also delete the occasional genuine message.
The various versions released to the public. These are actually several accumulated minor changes, which you can see by looking at the log in the top of the source code files.
Released to the public domain in 2002 by the author, Alexander G. M. Smith.