Menu
Index

Contact
Atom Feed
Comments Atom Feed

Similar Articles

2008-12-20 16:39
Tag me not!
2009-01-22 23:15
Blog Navigation
2008-10-26 12:09
State of the blog
2008-10-21 21:42
All (... the important ones anyway) systems GO!
2008-10-18 10:41
The Blog Project - Kicking off

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

The problems with tags

The only chance I get to add new features seems to be on weekends, so here are today's, not that I am actually doing much.

Apart from 3 nights out in a row, I have a new gadget (see next blog) which is causing me considerable distraction as well as indigestion.

On hold: tags

My intention with tags is that they become the primary way of finding stuff on the site. Because they are intelligently created (by human - ok, maybe not that intelligent then!), they should be the best way of indexing by content. This could also help search engine bots make the most of pages.

My difficulty is with creating a tidy way of handling maintaining tag indexes. Again, a database would be a good cheat for this: simply rely on the database to work this out.  Right now I am undecided on how to best do this, hence this features is on hold.

I am trying to resist a database, so the best way to implement this will come down to taking a guess at the load of maintaining the index vs. the load of reading the index. The simplest thing I can think of is a bunch of files, one per tag, each containing a list of articles, the number of times the tag occurs in the article, and the proportion of the tag for that article. On a read (I expect >90% of operations), all that needs to be done is to read each file for the tag - that way we would have a minimum of reads, but, if I wanted to be able to browse the tags, then each step of browsing would require reading all tags - not good for efficiency. On writes, this approach would be good since only the tag files we are interested in would be written.

On the other hand, if I built a browsable tree of all keyword combinations, then maintaining the index would be a pain. Any tag added would result in a tag tree for every tag on the article: massive overhead. I also want the system to be self-healing, and this system certainly won't be without a lot of work.

The best compromise I can think of is to add the other tags to the index for each tag. This gives the tag browser the hints to be efficient (only read the known tags associated with the tags so far), and as tags are added, the associated tags get updated, so self healing. Where it doesn't self heal is with tag removal (eg. spam), but this can easily be done with a maintenance task that gradually updates (oldest first) the tag index - one per run, say each hour. That should be sufficient.

Solved: tags

I guess this is one of the neat things with blogging - I just solved the problem by writing about it. I have quickly implemented the read handling code, and now need to find some time to put the index creation (write), and maintenance code in.

That will have to wait since I have plenty to be getting on with, but goes into the current....

TODO

  • Tidy up Atom feed code
  • Tidy up CSS
  • Finish off tagging
    • Tag index writing
    • Tag maintainance
    • Browsable tag index
    • Related Blogs via tags
  • Discussions
  • Statistics