Glen Pitt-Pladdy :: BlogBlog Navigation | |||
I guess what developing my own blog platform means is that the platform will grow around my blogging habits. One of the things I have been trying to achieve is to have meaningful cross linking between articles. I finally got round to sorting this. The seriesPossibly because of my in depth knowledge and experience in certain fields, I have found myself blogging in series of articles on one subject. This was again highlighted as I have been thinking about producing two new series of articles. One on security strategies, and because I noticed plenty of searches for my name from sources all round the world (presumably people I have met in my travels, friends from school, uni, past jobs etc.), one about me. There's obviously an audience, and maybe it will turn up some interesting stories and I will catch up with some old friends. The series is only going to work well if all the articles link to each other. Up until now I have been doing this manually, but that is a pain. What I want is a automatic way that the blog will just organise it's self. I came up with a few options..... Content based classificationThere are various algorithms for this type of thing. Probably one of the best (and best known) is Bayesian inference which is the basis for most modern "learning" spam filters. It can also be very effective for classification and automatic sorting of mail.There are even more exotic proprietary algorithms used by search engines. The thing I don't like with this is that I start having to make things more complicated (the whole point in this blog platform is that it is ultra simple), and would almost certainly have to resort to a database of some form. Manual classificationThis would be relatively straight forward - just have some classification tag on each article and then write some code to collect the articles together. There are a number of possible flaws:
Keyword based solutionPreviously I abandoned the idea of tagging (keywords added by visitors to the site). While I think the idea of tagging is sound, it does not work with sites that do not have a community of regular visitors (like this one), nor does it work effectively without large volumes of visitors (the duff tags need to be overwhelmed by good tags). The other problem with tags is that in the interests of simplicity, implementing an abuse resistant tagging mechanism was becoming increasingly complicated and that is against the fundamental aim of this platform. It had to go. The "meta keywords" field in web pages has little real value and has been long abused - just look at the number of competitor trademarks, dodgy keywords, and completely unrelated keywords that web sites are peppered with. Search engines just ignore it because on many sites it is so misleading and undermines offering quality results to their users. That is not to say it has no merit. What "meta keywords" has when used correctly (ie. good quality relevant keywords) is a consistent tagging system (the author decides) that works for low volume sites where most visitors come from searching for a specific piece of information on a search engine rather than a community built around the site. There is no need for any analysis algorithm - simply match up the level of correlation between keywords for each page and we have a score. The solutionThe "meta keywords" based solution is what I have gone for. Parsing all the pages for keywords for each page that loads would be huge performance hit, so I have a dummy maintainance page which gets called periodically via wget in a cron job. This runs through and maintains a list of all the articles associated with a particular keyword, and stores them as simple flat text files. When a page loads, it simply looks up the articles that match it's own keywords and adds up how many times each article appears. From there we have a score for correlation. I also had to set a threshold so that more common words didn't cause irrelevant articles to show up, but the thing that surprised me was how low this threshold could be to get really good matches. Currently I have it set to show articles with a 10% or better correlation. I think the reason why this is so effective is because of the consistency and quality of the keywords, and it makes me wonder how good search engines would actually be if it weren't for all the people trying to game the system. That's all for now, and hopefully you have a relevant "Similar articles" section up on the left when you visit this page. |
|||
Disclaimer: This is a load of random thoughts, ideas and other nonsense and is not intended to be taken seriously. I have no idea what I am doing with most of this so if you are stupid and naive enough to believe any of it, it is your own fault and you can live with the consequences. More importantly this blog may contain substances such as humor which have not yet been approved for human (or machine) consumption and could seriously damage your health if taken seriously. If you still feel the need to litigate (or whatever other legal nonsense people have dreamed up now), then please address all complaints and other stupidity to yourself as you clearly "don't get it".
Copyright Glen Pitt-Pladdy 2008-2023
|
Comments: