Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Similar Articles

07/07/2015 21:40
Bayesian Classifier Classes for Python
22/11/2009 15:20
IMDB ratings for MythTV
22/12/2011 16:59
LAMP (PHP) Pagetimer for Cacti via SNMP

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

Bayesian Classifier Classes for Perl and PHP

I've been doing a load of updates to this blog platform recently and among the things I've been trying to do is get a better classifier for comment SPAM. Currently there are several layers including Re-captcha and various layers of additional checks and combinations of parameters and >>90% of comments where already handled automatically. Re-captcha has become less effective of late with what is clearly automated or semi-automated completion of the comment form.

I'm also working on another automated text classification problem for another site and while experimenting with classifiers I decided to see what I could accomplish with Bayesian Classification. This has proven very effective for SPAM detection with mail and project management / support tools like FogBugz uses it for automatic sorting of email.

I had a search around for ready-to-go PHP classes I could drop in and drew a blank. There are lots of Bayesian classifier classes, but I wanted one that I could just drop in and it would use existing databases for storage. That proved difficult and eventually I just decided it would be quicker to write my own than trawl pages, dead links and unsuitable classes.

The class proved so effective I decided to do a Perl port of the class allowing universal access. I also added support for SQLite3 which is useful for non-web applications.

I've tried to keep things very generic and the class accepts an arbitrary number of classifications so it could be used for intelligent mail sorting, SPAM detection, automatically classifying TV programmes (an experiment I'm busy doing) and much more.

It also has features to monitor the value of words for classification and allowing old classification data to decay so that recent training takes precedent.

If you are just looking for a command line tool then using something like dbacl will probably save you a lot of hassle. The point of this is the classes to build into your own tools with database storage.

Update: Python

Since I keep using this basic building block for so many new things I have ended up also creating a Python version of the Bayesian Classifier.

Backend Database

The classes use DBI (Perl) or PDO (PHP) and need very little else. The only thing you need to worry about is setting up the databases which can be accomplished with SQL scripts supplied. In the case of MySQL we start assuming an empty database with whatever user privileges already set:

$ mysql -u dbuser -pdbpass database < classifier_mysql.sql

or...

$ sqlite3 /path/to/database.db < classifier_sqlite.sql

SQLite is certainly a good place to start as it's easy to copy the database for experimenting. You could even check the database into a repo as you go.

Livening up the class

First step is to bring the class onboard. In Perl:

use classifier;

or in PHP:

require_once ( 'classifier.php' );

Then for each lump of text you want to process then create a new object. In Perl:

my $classifier = classifier->new ( $dbh, join ( '', <STDIN> ) );

or in PHP:

$classifier = new classifier ( $dbh, ...... );

These read from STDIN and are given the database handle $dbh. This must be a DBI handle for Perl or a PDO handle in PHP.

Stopwords

Newer versions have support for stopwords. You can get a list from one of the many places on the web and use that with the classifier to remove low-relevance words such as "the" which don't contribute a lot to the subject matter. This generally improves the effectiveness by increasing the proportion of relevant words.

In Perl:

$classifier->removestopwords ( @stopwords );

In PHP:

$classifier->removestopwords ( $stopwords );

These simply remove any occurrences of the stopwords from the text you loaded into the class.

Teaching the classifier

First off you need to teach the classifier about the different types of text you want to process. For this we use the classifier:teach() method. This takes up to 3 arguments:

  • classification (required) - positive integer
  • weight (optional) - defauts to 1 and allows the weight of training to be varied. eg. reduce lower-certainty samples
  • order (optional) - defaults to 1/true and enables word ordering (pairs of words) to be classified for potentially greater discrimination

With Perl you can use something like:

$classifier->teach  ( 5, 0.2, 0 );

or in PHP:

$classifier->teach  ( 5, 0.2, false );

Those train the text as classification 5 with a weight of 0.2 (20% of default) and does not train using word order.

Where the weighting can be used is if you want to auto-train on new material. For example if an inbound message gets a SPAM score >0.9 we may then train automatically train the classifier, but only with a weight of 0.1 so that it doesn't have a major impact if it's a false classification. Then for a SPAM score <0.1 (ie. HAM / not-SPAM) we could auto-train with a weight of 0.2. That way the balance it tipped towards safety (not discarding messages).

It is always worth having human input to avoid the system going unstable.

Trick: Forgetting training

In some cases it may be useful to forget some previous training. An example of this happens with this blog: when a comment is posted and it is sufficiently strongly scored as either SPAM or HAM the classifier will automatically train on that message. In rare cases that it gets it wrong and a manual-override is used we need to forget the previous training before we re-train on the correct classification.

The trick is simply to use a negative weight to reverse the previously learned value. Beware that this can't be done once degrading (see later) has been done without compensating for the degrades.

Classifying text

This time we use the classifier::classify() method this takes up to two arguments:

  • classifications (required) - array of the different classifications (integers) you want to classify the text under
  • useorder (optional) - defaults to 0 and gives the proportion of the word order classification data you want to use. 0 = only word frequency, 1 = only word order

A typical call in perl could be:

my @scores = $classifier->classify ( [ 1, 2,5,8], 0.3 );

or in PHP:

$scores = $classifier->classify ( array(1,2,5,8), 0.3 );

These will classify the text under classes 1, 2, 5 and 8 with a proportion of 0.3 (30%) of the word ordering classification and the remaining 0.7 (70%) of the word frequency classification.

One thing to take note of here is that in Perl you have to pass an array reference. Perl functions take their arguments as an array so can't tell the difference between an array among the arguments and other arguments. Because of this the most practical way of passing arrays is as a reference.

About bias

When classifying the frequency of different classes is taken into account - ie. if we get almost all SPAM then the classifier is likely to classify text as SPAM.... which has implications when it comes to false positives. To get it to treat all classes with even (unbiased) odds set the "unbiased" flag. For Perl:

 $classifier->{'unbiased'} = 1;

or in PHP:

 $classifier->unbiased = true;

Other methods

There are two more methods which can be useful. First up, degrading existing data so that newer training takes precedence. This could be done on a CRON job or via a maintenance URL that is periodically pinged on a site. It's the same for Perl and PHP:

$classifier->degrade ( 0.9 );

That weights the existing data by 0.9 each time it's called. That gives training a half-life of 1 week if it's called daily. If you want to get pedantic then you could use 0.905723664264...  :-)

The other method is the word Quality score update. This always runs from the word that has been longest since it's Quality score was last updated. It has one optional argument which is the number of words to process else it does them all. Again it's the same in Perl as PHP:

$classifier->updatequality ( 200 );

That will update 200 words each time it's run. This is ideal for slow background updating. The word quality is not used in the classifier currently, but it may be of interest for other types of classification (eg. SEO or tagging). It may also be the kind of thing curious geeks like to know  :-)

Downloads

I've bundled everything into one tarball to make it easier. There is also a Perl and PHP command-line example that uses SQLite. There are also SQL files for both MySQL and SQLite3 to create all the database tables needed.

Download: Perl, PHP and Python Bayesian Classifiers Class & CLI tools are on GitHub

How well does it work?

Initial indications are positive. In fact I haven't seen it fail  or even just give a indecisive score yet on the several SPAM comments each day on this blog, but it's always worth doing further tests.

I've created two command-line wrappers to the class using SQLite3 databases that take their input from STDIN which means you can pipe curl / wget or just plain files into them to test. For comparison I've also used another open-source tool, dbacl for comparison.

I tested by piping w3m into the tool and trained with 5 pages from two popular news sites, then used 2 different pages on the same sites for comparison to see how it would detect the different sites:

$ w3m -dump -cols 160 -F  <SITE1 page6> | ./classifier classify 1 2
class1: 1
class2: 0
$ w3m -dump -cols 160 -F  <SITE1 page7> | ./classifier classify 1 2
class1: 1
class2: 0
$ w3m -dump -cols 160 -F  <SITE2 page6> | ./classifier classify 1 2
class1: 0
class2: 1
$ w3m -dump -cols 160 -F  <SITE2 page7> | ./classifier classify 1 2
class1: 8.83618605751473e-136
class2: 1

It's plenty good enough for everything I'm doing and with some embellishment may even do the job for a complex classification task I'm working on.

More on word order

This is mainly an experimental approach. Normally just word frequency is used for Bayesian classification, but I suspect that the order that words appear may also hold further clues as to the intent behind the text.

The disadvantage of taking into account pairs of words is that the number of combinations increases massively. Assuming random input it would be an O(n2) problem and that does not scale. Fortunately it's probably rather better than that as language isn't random.

I'm not sure I have enough data yet, but my initial tests are showing that for about 4000 unique words the Frequency table has about 4000 rows for two classifications and the OrderFrequency table has about 10000 rows. That's a non-spammy classification so has well behaved (non-random) text and the OrderFrequency is way below O(n2).

Another has about 2800 unique words with the Frequency table having about 3200 rows and the OrderFrequency table about 9500 rows, again for two classifications. That one is used for SPAM detection and so there is a lot more poor language and general grabage in it and you can see the effect with OrderFrequency being proportionally larger for the number of words, but it's still way below O(n2).

With a well indexed database that's not a significant amount of storage or a performance problem. In fact as only a small subset of the words take part in classification each time, it's really quite manageable for most applications.

What's the most useless word?

For SPAM detection on this blog it's currently "very" with a quality score of 0.004 with "just" coming up behind with a score of 0.007. Interestingly these words are frequently used for controlling emphasis and are that two extremes. "very" => high emphasis, ''just" => low emphasis.

At the other end of the spectrum the most useful word for SPAM detection on this blog is currently "href".... funny that - all those attempts at linking URLs in SPAM are a bit of a give-away!

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately