Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Similar Articles

20/12/2008 16:39
Tag me not!
26/10/2008 12:09
State of the blog
21/10/2008 21:42
All (... the important ones anyway) systems GO!
18/10/2008 10:41
The Blog Project - Kicking off
26/12/2009 09:30
Blog Evolution
02/11/2008 13:23
The problems with tags

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

Blog Navigation

I guess what developing my own blog platform means is that the platform will grow around my blogging habits. One of the things I have been trying to achieve is to have meaningful cross linking between articles. I finally got round to sorting this.

The series

Possibly because of my in depth knowledge and experience in certain fields, I have found myself blogging in series of articles on one subject. This was again highlighted as I have been thinking about producing two new series of articles. One on security strategies, and because I noticed plenty of searches for my name from sources all round the world (presumably people I have met in my travels, friends from school, uni, past jobs etc.), one about me. There's obviously an audience, and maybe it will turn up some interesting stories and I will catch up with some old friends.

The series is only going to work well if all the articles link to each other. Up until now I have been doing this manually, but that is a pain. What I want is a automatic way that the blog will just organise it's self.

I came up with a few options.....

Content based classification

There are various algorithms for this type of thing. Probably one of the best (and best known) is Bayesian inference which is the basis for most modern "learning" spam filters. It can also be very effective for classification and automatic sorting of mail.There are even more exotic proprietary algorithms used by search engines.

The thing I don't like with this is that I start having to make things more complicated (the whole point in this blog platform is that it is ultra simple), and would almost certainly have to resort to a database of some form.

Manual classification

This would be relatively straight forward - just have some classification tag on each article and then write some code to collect the articles together.

There are a number of possible flaws:

  • A typo in the classification (unless it is done via selection) would render an article lost out of the group.
  • There would be no indication (or prioritisation by) the quality of the match between articles.
  • It would require me to make sure I put new articles into the right classification which may be a bit difficult if I want to do a new article in an old series, or just want to match up similar non-series articles.

Keyword based solution

Previously I abandoned the idea of tagging (keywords added by visitors to the site). While I think the idea of tagging is sound, it does not work with sites that do not have a community of regular visitors (like this one), nor does it work effectively without large volumes of visitors (the duff tags need to be overwhelmed by good tags). The other problem with tags is that in the interests of simplicity, implementing an abuse resistant tagging mechanism was becoming increasingly complicated and that is against the fundamental aim of this platform. It had to go.

The "meta keywords" field in web pages has little real value and has been long abused - just look at the number of competitor trademarks, dodgy keywords, and completely unrelated keywords that web sites are peppered with. Search engines just ignore it because on many sites it is so misleading and undermines offering quality results to their users.

That is not to say it has no merit. What "meta keywords" has when used correctly (ie. good quality relevant keywords) is a consistent tagging system (the author decides) that works for low volume sites where most visitors come from searching for a specific piece of information on a search engine rather than a community built around the site. There is no need for any analysis algorithm - simply match up the level of correlation between keywords for each page and we have a score.

The solution

The "meta keywords" based solution is what I have gone for. Parsing all the pages for keywords for each page that loads would be huge performance hit, so I have a dummy maintainance page which gets called periodically via wget in a cron job. This runs through and maintains a list of all the articles associated with a particular keyword, and stores them as simple flat text files.

When a page loads, it simply looks up the articles that match it's own keywords and adds up how many times each article appears. From there we have a score for correlation. I also had to set a threshold so that more common words didn't cause irrelevant articles to show up, but the thing that surprised me was how low this threshold could be to get really good matches.

Currently I have it set to show articles with a 10% or better correlation. I think the reason why this is so effective is because of the consistency and quality of the keywords, and it makes me wonder how good search engines would actually be if it weren't for all the people trying to game the system.

That's all for now, and hopefully you have a relevant "Similar articles" section up on the left when you visit this page.

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately