Glen Pitt-Pladdy :: Blog
dirvish-checksum available again
An age old problem with backups is how to validate them. Periodic restores are a good test but it's also useful to have a quick check against corruption and checksums are very useful for that. For smaller networks (home and smaller businesses) I have been happily using dirvish for some time. It is based on rsync (a well tested tool) and wraps it with a load of other useful features like being able to execute pre and post scripts which are very useful for controlling LVM snapshots and other backup related tasks.
The one thing that it doesn't appear to have is a means to generate checksum validations to check backups against corruption or damage, and for this I created dirvish-checksum several years ago. It creates MD5 and SHA1 checksums of backups and detects the hard links, and uses checksums from previous generations where hard-links are in use, generally trying to be efficient while maintaining valid checksums.
Over the years I have adapted and customised the basic script for specific needs of companies I have worked as well as friends that have used dirivish, while updating and extending the original script for my own needs. Most of the updates to my script have been performance related and include caching mechanisms to make the process far more efficient and improved handling of data within the script.
What I have only recently discovered is that the publicly available version from many years back had been removed from this site when I created this blog, and now it returns. There are still plenty more refinements possible and things in the code that I will clear up with time so keep an eye on this page.
By default, like dirvish it's self, dirvish-checksum will look at /etc/dirvish/master.conf for config and then read in the config specified by the --config=<configfile> option.
Then it will run through all the vaults in all the banks specified in the config files and attempt to create MD5SUMS.bz2 and SHA1SUMS.bz2 for the backup. Optionally you can disable each of these with the --nomd5 and --nosha1 options.
An optimisation in more recent versions is to read index.gz to avoid having to stat every file to find it's inode. This dramatically speeds up the process of determining if the new generation of a file is hard linked to the previous one.
In cases where the new file is found to be hard linked to the previous generation, there is no point in checksumming again as the data will be the same. The only situation where data may have change is if there has been some corruption and that should be found when you do periodic checks on your backups anyway.
In most cases you can run dirvish-checksum with no arguments and it will pick up your config and generate the checksums. The first run will be slow as it will have to checksum every file in a vault, however subsequent generations will be far more efficient as it will be only dealing with changes.
This uses the standard -c option with md5sum or sha1sum. The syntax is identical for both:
.../tree# bzcat ../MD5SUMS.bz2 | md5sum -c
It is a good idea to validate the checksums periodically as with the hard linking if the file data of a file becomes corrupted then it can mean that multiple (perhaps all) generations of that file are also going to be corrupted as they contain the same data.
Download / Install
Download: dirvish-checksum 20120421 Perl script
This will need making executable (chmod +x) and placing in a suitable place. Typically this would be somewhere like /usr/local/sbin.
I run this as part of my backup script which includes setting up ssh keys, checking availability of devices (eg. laptops, phones) and ensuring that many causes of failure are avoided.
The script will require Perl, plus md5sum, sha1sum, bzip2 and find commands to be available. Otherwise it doesn't depend on anything that should not be normally available.
Copyright Glen Pitt-Pladdy 2008-2013