Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Similar Articles

20/12/2008 16:39
Tag me not!
22/01/2009 23:15
Blog Navigation
26/10/2008 12:09
State of the blog
21/10/2008 21:42
All (... the important ones anyway) systems GO!
18/10/2008 10:41
The Blog Project - Kicking off
10/02/2010 19:28
Blog Spamming

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

The problems with tags

The only chance I get to add new features seems to be on weekends, so here are today's, not that I am actually doing much.

Apart from 3 nights out in a row, I have a new gadget (see next blog) which is causing me considerable distraction as well as indigestion.

On hold: tags

My intention with tags is that they become the primary way of finding stuff on the site. Because they are intelligently created (by human - ok, maybe not that intelligent then!), they should be the best way of indexing by content. This could also help search engine bots make the most of pages.

My difficulty is with creating a tidy way of handling maintaining tag indexes. Again, a database would be a good cheat for this: simply rely on the database to work this out.  Right now I am undecided on how to best do this, hence this features is on hold.

I am trying to resist a database, so the best way to implement this will come down to taking a guess at the load of maintaining the index vs. the load of reading the index. The simplest thing I can think of is a bunch of files, one per tag, each containing a list of articles, the number of times the tag occurs in the article, and the proportion of the tag for that article. On a read (I expect >90% of operations), all that needs to be done is to read each file for the tag - that way we would have a minimum of reads, but, if I wanted to be able to browse the tags, then each step of browsing would require reading all tags - not good for efficiency. On writes, this approach would be good since only the tag files we are interested in would be written.

On the other hand, if I built a browsable tree of all keyword combinations, then maintaining the index would be a pain. Any tag added would result in a tag tree for every tag on the article: massive overhead. I also want the system to be self-healing, and this system certainly won't be without a lot of work.

The best compromise I can think of is to add the other tags to the index for each tag. This gives the tag browser the hints to be efficient (only read the known tags associated with the tags so far), and as tags are added, the associated tags get updated, so self healing. Where it doesn't self heal is with tag removal (eg. spam), but this can easily be done with a maintenance task that gradually updates (oldest first) the tag index - one per run, say each hour. That should be sufficient.

Solved: tags

I guess this is one of the neat things with blogging - I just solved the problem by writing about it. I have quickly implemented the read handling code, and now need to find some time to put the index creation (write), and maintenance code in.

That will have to wait since I have plenty to be getting on with, but goes into the current....

TODO

  • Tidy up Atom feed code
  • Tidy up CSS
  • Finish off tagging
    • Tag index writing
    • Tag maintainance
    • Browsable tag index
    • Related Blogs via tags
  • Discussions
  • Statistics

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately