Glen Pitt-Pladdy :: BlogBayesian Classifier Classes for Python | |||
Previously I published Perl and PHP classes for Bayesian Classification which was a key component in a few projects of mine, including managing spamming attempts on this blog. Seeing that dspam seems to have not seen active development for a while, I needed a new classifier for mail. I've also been messing with pymilter as the basis of my own SMTP filtering. The obvious thing to do was to create a port of my previous Classifiers to Python. That way I also have 3-way comparison testing and database compatibility. I've been running with this class for a while now and it seems to work as expected after a few things got ironed out. These were largely corner cases which didn't occur with previous projects, but caused my Milter to flake out (raise Exceptions) in all sorts of unexpected ways. Some of these include things like having Transaction Deadlocks when two mails are being processed concurrently and contain the same words. I've also ported these fixes to the original Perl and PHP versions. Backend DatabaseThis is exactly the same as with the Perl and PHP versions. Databases are created with SQL scripts supplied before. In the case of MySQL we start assuming an empty database with whatever user privileges already set: $ mysql -u dbuser -pdbpass database < classifier_mysql.sql or... $ sqlite3 /path/to/database.db < classifier_sqlite.sql I have applications using both SQLite and MySQL, but obviously the former is a much easier starting point. Livening up the classFirst step is to bring the class onboard. In Perl: import classifier Then for each lump of text you want to process then create a new object, in this example reading text from STDIN: classifier = classifier.classifier ( db, sys.stdin.read() ) Databases are standard Python DB-API, in my case sqlite3 or MySQLdb. StopwordsNewer versions have support for stopwords. You can get a list from one of the many places on the web and use that with the classifier to remove low-relevance words such as "the" which don't contribute a lot to the subject matter, normally resulting in better classification. classifier.removestopwords ( stopwords ) This removes any occurrences of the stopwords from the text you loaded into the class. Teaching the classifierAs before, you need to teach the classifier about the different types of text you want to process. For this we use the classifier.teach() method. This takes up to 3 arguments:
You can use something like: classifier.teach ( 5, 0.2, False ) Those train the text as classification 5 with a weight of 0.2 (20% of default) and does not train using word order (False). Where the weighting can be used is if you want to auto-train on incoming mails. Inbound message gets strong classification (high level of confidence) could be used to train automatically with a low weight of say 0.1 so that it doesn't have a major impact if it's a false classification. Asymmetric training could be used to weight HAM higher than SPAM for safety so that it will classify in favour of HAM. It is always worth having human input to avoid the system going unstable or being poisoned by spammers. Trick: Forgetting trainingAs before it may be useful to forget some previous training (eg. correcting a misclassification). The trick is simply to use a negative weight to reverse the previously learned value. Beware that this can't be done once degrading (see later) has been done without compensating for the degrades. Classifying textThis time we use the classifier.classify() method this takes up to two arguments:
A typical example could be: scores = classifier.classify ( [ 1, 2, 5, 8 ], 0.3 ) These will classify the text under classes 1, 2, 5 and 8 with a proportion of 0.3 (30%) of the word ordering classification and the remaining 0.7 (70%) of the word frequency classification. About biasWhen classifying the frequency of different classes is taken into account - ie. if we get almost all SPAM then the classifier is likely to classify text as SPAM.... which has implications when it comes to false positives. To get it to treat all classes with even (unbiased) odds set the "unbiased" flag: classifier.unbiased = True Other methodsAs before, there are two more methods which can be useful. Degrading existing data so that newer training takes precedence (eg. done on a CRON job or via a maintenance URL that is periodically pinged on a site): classifier.degrade ( 0.9 ) That weights the existing data by 0.9 each time it's called giving training a half-life of 1 week if it's called daily. The other method is the word Quality score update. This always runs from the word that has been longest since it's Quality score was last updated. It has one optional argument which is the number of words to process else it does them all: classifier.updatequality ( 200 ) That will update 200 words each time it's run. This is ideal for slow background updating. The word quality is not used in the classifier currently, but it may be of interest for other types of classification. DownloadsI've bundled everything into one tarball to make it easier. There is also a Perl and PHP command-line example that uses SQLite. There are also SQL files for both MySQL and SQLite3 to create all the database tables needed. Download: Python Bayesian Classifier Class & CLI tool on GitHub
|
|||
Disclaimer: This is a load of random thoughts, ideas and other nonsense and is not intended to be taken seriously. I have no idea what I am doing with most of this so if you are stupid and naive enough to believe any of it, it is your own fault and you can live with the consequences. More importantly this blog may contain substances such as humor which have not yet been approved for human (or machine) consumption and could seriously damage your health if taken seriously. If you still feel the need to litigate (or whatever other legal nonsense people have dreamed up now), then please address all complaints and other stupidity to yourself as you clearly "don't get it".
Copyright Glen Pitt-Pladdy 2008-2023
|
Comments: