Glen Pitt-Pladdy :: BlogState of the blog | |||
It's now a week since I kicked off this project, and despite only spending an hour here and there on it, it has progressed nicely. This is what has happened since the last update. To database, or not to database.... that is the questionThe whole aim of this project from the start has been simplicity. My intention is a system that can be dropped in and just work. In keeping with this approach, I opted for for a very basic back-end storage system - simply sling a file called article.html in a directory named with the date format yyyymmdd-hhmmss[-+]OOoo. This provides all the date info for RFC 3339 dates used in RSS / Atom feeds, and article lists. Then I devised some header rewriting for serving additional files. All this makes running the blog very simple - just use a (x)html editor (currently I'm using kompozer), create the directory and save the file, and that's all - the article is up. No need for a transaction based system, or anything any more complicated. The indexing code in the back-end libraries pick the relevant info out of the pages on demand. I have even decided not to cache article data since currently only 20 pages are loaded at most in any request, the pages are so small and anything accessed regularly will be in the file system cache anyway. Measuring the load time for the index page a number of times in succession comes out at 78ms, 20ms, 33ms, 66ms, 18ms, 22ms over the local network. To really test it, I scripted up 100 consecutive requests which it did in 2.074s - more than good enough for now. Performance will deteriorate with time as more articles are added, but for the traffic I am expecting, I don't see how this is likely to become a problem. Even Slashdot Effect seems to peak a few hundred requests a minute, or about 15 seconds per 100 requests. I expect my ADSL connection would be the bottle neck there rather than how quick the page loads. Where a database would definitely make more sense is when it comes to discussions, tags and searching. These are all things where I can't control the timing of data being added - they happen when they do, and the transactional model of a database as well as keeping everything together for searching would certainly help. Tempting as it may seem, I am going to try avoid a database for now, if only in the name of simplicity of setup. I want this system to be able to work on minimalist hosting providers where all that is available is basic PHP. How? Well, file systems are in some ways adequate. Issues like simultaneous access to files can be addressed by locking, unique file names (eg. combinations of inode, host, pid), temporary file names, and various other mechanisms. That said, a database would make things easier in the long run. Features.....There has been minor changes for adding files / images / other links in, and I have added some http redirects to take care of this until people / bots have updated, which Googlebot seems to have done already last night. The main blog index is now paginated at 20 articles, with a page selector that appears automatically when we hit that amount, with the number of articles per page easily changed in the config. Likewise the navigation (recent) index is only 5 articles, again easily changed in the config. At the same time, I have updated the back-end code for listing the articles so that it only reads the articles it needs to (on-demand) which should improve performance as the blog grows. The other thing is the beginnings of some basic tagging. This still needs some basic anti-abuse mechanisms adding, caching and indexing the articles by tag. I am still playing with this feature, so expect it to change and existing tags to be removed at some point. One of the key things is how to handle abuse. The mechanism I am looking at right now is a one IP, one set of tags (per article) approach. Each IP will be able to add one set of space separated tags, with no punctuation (apart from apostrophes). Any tagging attempts that do not match this format aresilently ignored. I may look at some form of regex based filters to nuke regular abuse. Also, I only display / search the most common tags which will mean that provided the legitimate / abuse ratio can be kept favourable, the system should be inherently immune to abuse. |
|||
This is a bunch of random thoughts, ideas and other nonsense, and is not intended to be taken seriously. I'm experimenting and mostly have no idea what I am doing with most of this so it should be taken with cuation and at your own risk. Intrustive technologies are minimised where possible. For the purposes of reducing abuse and other risks hCaptcha is used and has it's own policies linked from the widget.
Copyright Glen Pitt-Pladdy 2008-2023
|