Glen Pitt-Pladdy :: Blog
Bayesian Classifier Classes for Python
Previously I published Perl and PHP classes for Bayesian Classification which was a key component in a few projects of mine, including managing spamming attempts on this blog.
Seeing that dspam seems to have not seen active development for a while, I needed a new classifier for mail. I've also been messing with pymilter as the basis of my own SMTP filtering. The obvious thing to do was to create a port of my previous Classifiers to Python. That way I also have 3-way comparison testing and database compatibility.
I've been running with this class for a while now and it seems to work as expected after a few things got ironed out. These were largely corner cases which didn't occur with previous projects, but caused my Milter to flake out (raise Exceptions) in all sorts of unexpected ways. Some of these include things like having Transaction Deadlocks when two mails are being processed concurrently and contain the same words. I've also ported these fixes to the original Perl and PHP versions.
This is exactly the same as with the Perl and PHP versions. Databases are created with SQL scripts supplied before. In the case of MySQL we start assuming an empty database with whatever user privileges already set:
$ mysql -u dbuser -pdbpass database < classifier_mysql.sql
$ sqlite3 /path/to/database.db < classifier_sqlite.sql
I have applications using both SQLite and MySQL, but obviously the former is a much easier starting point.
Livening up the class
First step is to bring the class onboard. In Perl:
Then for each lump of text you want to process then create a new object, in this example reading text from STDIN:
classifier = classifier.classifier ( db, sys.stdin.read() )
Newer versions have support for stopwords. You can get a list from one of the many places on the web and use that with the classifier to remove low-relevance words such as "the" which don't contribute a lot to the subject matter, normally resulting in better classification.
classifier.removestopwords ( stopwords )
This removes any occurrences of the stopwords from the text you loaded into the class.
Teaching the classifier
As before, you need to teach the classifier about the different types of text you want to process. For this we use the classifier.teach() method. This takes up to 3 arguments:
You can use something like:
classifier.teach ( 5, 0.2, False )
Those train the text as classification 5 with a weight of 0.2 (20% of default) and does not train using word order (False).
Where the weighting can be used is if you want to auto-train on incoming mails. Inbound message gets strong classification (high level of confidence) could be used to train automatically with a low weight of say 0.1 so that it doesn't have a major impact if it's a false classification. Asymmetric training could be used to weight HAM higher than SPAM for safety so that it will classify in favour of HAM.
It is always worth having human input to avoid the system going unstable or being poisoned by spammers.
Trick: Forgetting training
As before it may be useful to forget some previous training (eg. correcting a misclassification). The trick is simply to use a negative weight to reverse the previously learned value. Beware that this can't be done once degrading (see later) has been done without compensating for the degrades.
This time we use the classifier.classify() method this takes up to two arguments:
A typical example could be:
scores = classifier.classify ( [ 1, 2, 5, 8 ], 0.3 )
These will classify the text under classes 1, 2, 5 and 8 with a proportion of 0.3 (30%) of the word ordering classification and the remaining 0.7 (70%) of the word frequency classification.
When classifying the frequency of different classes is taken into account - ie. if we get almost all SPAM then the classifier is likely to classify text as SPAM.... which has implications when it comes to false positives. To get it to treat all classes with even (unbiased) odds set the "unbiased" flag:
classifier.unbiased = True
As before, there are two more methods which can be useful. Degrading existing data so that newer training takes precedence (eg. done on a CRON job or via a maintenance URL that is periodically pinged on a site):
classifier.degrade ( 0.9 )
That weights the existing data by 0.9 each time it's called giving training a half-life of 1 week if it's called daily.
The other method is the word Quality score update. This always runs from the word that has been longest since it's Quality score was last updated. It has one optional argument which is the number of words to process else it does them all:
classifier.updatequality ( 200 )
That will update 200 words each time it's run. This is ideal for slow background updating. The word quality is not used in the classifier currently, but it may be of interest for other types of classification.
I've bundled everything into one tarball to make it easier. There is also a Perl and PHP command-line example that uses SQLite. There are also SQL files for both MySQL and SQLite3 to create all the database tables needed.
Copyright Glen Pitt-Pladdy 2008-2017