Menu
Index

Contact
Atom Feed
Comments Atom Feed

Similar Articles

2011-12-29 21:47
Bayesian Classifier Classes for Perl and PHP
2009-11-22 15:20
IMDB ratings for MythTV
2009-11-14 13:46
Apache stats on Cacti (via SNMP)
2008-10-26 12:09
State of the blog
2012-04-26 16:59
MySQL Performance Graphs on Cacti via SNMP

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

Bayesian Classifier Classes for Python

Previously I published Perl and PHP classes for Bayesian Classification which was a key component in a few projects of mine, including managing spamming attempts on this blog.

Seeing that dspam seems to have not seen active development for a while, I needed a new classifier for mail. I've also been messing with pymilter as the basis of my own SMTP filtering. The obvious thing to do was to create a port of my previous Classifiers to Python. That way I also have 3-way comparison testing and database compatibility.

I've been running with this class for a while now and it seems to work as expected after a few things got ironed out. These were largely corner cases which didn't occur with previous projects, but caused my Milter to flake out (raise Exceptions) in all sorts of unexpected ways. Some of these include things like having Transaction Deadlocks when two mails are being processed concurrently and contain the same words. I've also ported these fixes to the original Perl and PHP versions.

Backend Database

This is exactly the same as with the Perl and PHP versions. Databases are created with SQL scripts supplied before. In the case of MySQL we start assuming an empty database with whatever user privileges already set:

$ mysql -u dbuser -pdbpass database < classifier_mysql.sql

or...

$ sqlite3 /path/to/database.db < classifier_sqlite.sql

I have applications using both SQLite and MySQL, but obviously the former is a much easier starting point.

Livening up the class

First step is to bring the class onboard. In Perl:

import classifier

Then for each lump of text you want to process then create a new object, in this example reading text from STDIN:

classifier = classifier.classifier ( db, sys.stdin.read() )

Databases are standard Python DB-API, in my case sqlite3 or MySQLdb.

Stopwords

Newer versions have support for stopwords. You can get a list from one of the many places on the web and use that with the classifier to remove low-relevance words such as "the" which don't contribute a lot to the subject matter, normally resulting in better classification.

classifier.removestopwords ( stopwords )

This removes any occurrences of the stopwords from the text you loaded into the class.

Teaching the classifier

As before, you need to teach the classifier about the different types of text you want to process. For this we use the classifier.teach() method. This takes up to 3 arguments:

  • classification (required) - positive integer
  • weight (optional) - defauts to 1 and allows the weight of training to be varied. eg. reduce lower-certainty samples
  • order (optional) - defaults to 1/true and enables word ordering (pairs of words) to be classified for potentially greater discrimination

You can use something like:

classifier.teach  ( 5, 0.2, False )

Those train the text as classification 5 with a weight of 0.2 (20% of default) and does not train using word order (False).

Where the weighting can be used is if you want to auto-train on incoming mails. Inbound message gets strong classification (high level of confidence) could be used to train automatically with a low weight of say 0.1 so that it doesn't have a major impact if it's a false classification. Asymmetric training could be used to weight HAM higher than SPAM for safety so that it will classify in favour of HAM.

It is always worth having human input to avoid the system going unstable or being poisoned by spammers.

Trick: Forgetting training

As before it may be useful to forget some previous training (eg. correcting a misclassification). The trick is simply to use a negative weight to reverse the previously learned value. Beware that this can't be done once degrading (see later) has been done without compensating for the degrades.

Classifying text

This time we use the classifier.classify() method this takes up to two arguments:

  • classifications (required) - array of the different classifications (integers) you want to classify the text under
  • useorder (optional) - defaults to 0 and gives the proportion of the word order classification data you want to use. 0 = only word frequency, 1 = only word order

A typical example could be:

scores = classifier.classify ( [ 1, 2, 5, 8 ], 0.3 )

These will classify the text under classes 1, 2, 5 and 8 with a proportion of 0.3 (30%) of the word ordering classification and the remaining 0.7 (70%) of the word frequency classification.

About bias

When classifying the frequency of different classes is taken into account - ie. if we get almost all SPAM then the classifier is likely to classify text as SPAM.... which has implications when it comes to false positives. To get it to treat all classes with even (unbiased) odds set the "unbiased" flag:

classifier.unbiased = True

Other methods

As before, there are two more methods which can be useful. Degrading existing data so that newer training takes precedence (eg. done on a CRON job or via a maintenance URL that is periodically pinged on a site):

classifier.degrade ( 0.9 )

That weights the existing data by 0.9 each time it's called giving training a half-life of 1 week if it's called daily.

The other method is the word Quality score update. This always runs from the word that has been longest since it's Quality score was last updated. It has one optional argument which is the number of words to process else it does them all:

classifier.updatequality ( 200 )

That will update 200 words each time it's run. This is ideal for slow background updating. The word quality is not used in the classifier currently, but it may be of interest for other types of classification.

Downloads

I've bundled everything into one tarball to make it easier. There is also a Perl and PHP command-line example that uses SQLite. There are also SQL files for both MySQL and SQLite3 to create all the database tables needed.

Download: Python Bayesian Classifier Class & CLI tool on GitHub

 

Comments:




Note: Identity details will be stored in a cookie. Posts may not appear immediately