Glen Pitt-Pladdy :: BlogThe problems with tags | |||
The only chance I get to add new features seems to be on weekends, so here are today's, not that I am actually doing much. Apart from 3 nights out in a row, I have a new gadget (see next blog) which is causing me considerable distraction as well as indigestion. On hold: tagsMy intention with tags is that they become the primary way of finding stuff on the site. Because they are intelligently created (by human - ok, maybe not that intelligent then!), they should be the best way of indexing by content. This could also help search engine bots make the most of pages. My difficulty is with creating a tidy way of handling maintaining tag indexes. Again, a database would be a good cheat for this: simply rely on the database to work this out. Right now I am undecided on how to best do this, hence this features is on hold. I am trying to resist a database, so the best way to implement this will come down to taking a guess at the load of maintaining the index vs. the load of reading the index. The simplest thing I can think of is a bunch of files, one per tag, each containing a list of articles, the number of times the tag occurs in the article, and the proportion of the tag for that article. On a read (I expect >90% of operations), all that needs to be done is to read each file for the tag - that way we would have a minimum of reads, but, if I wanted to be able to browse the tags, then each step of browsing would require reading all tags - not good for efficiency. On writes, this approach would be good since only the tag files we are interested in would be written. On the other hand, if I built a browsable tree of all keyword combinations, then maintaining the index would be a pain. Any tag added would result in a tag tree for every tag on the article: massive overhead. I also want the system to be self-healing, and this system certainly won't be without a lot of work. The best compromise I can think of is to add the other tags to the index for each tag. This gives the tag browser the hints to be efficient (only read the known tags associated with the tags so far), and as tags are added, the associated tags get updated, so self healing. Where it doesn't self heal is with tag removal (eg. spam), but this can easily be done with a maintenance task that gradually updates (oldest first) the tag index - one per run, say each hour. That should be sufficient. Solved: tagsI guess this is one of the neat things with blogging - I just solved the problem by writing about it. I have quickly implemented the read handling code, and now need to find some time to put the index creation (write), and maintenance code in. That will have to wait since I have plenty to be getting on with, but goes into the current.... TODO
|
|||
This is a bunch of random thoughts, ideas and other nonsense, and is not intended to be taken seriously. I'm experimenting and mostly have no idea what I am doing with most of this so it should be taken with cuation and at your own risk. Intrustive technologies are minimised where possible. For the purposes of reducing abuse and other risks hCaptcha is used and has it's own policies linked from the widget.
Copyright Glen Pitt-Pladdy 2008-2023
|