Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

dirvish-checksum available again

An age old problem with backups is how to validate them. Periodic restores are a good test but it's also useful to have a quick check against corruption and checksums are very useful for that. For smaller networks (home and smaller businesses) I have been happily using dirvish for some time. It is based on rsync (a well tested tool) and wraps it with a load of other useful features like being able to execute pre and post scripts which are very useful for controlling LVM snapshots and other backup related tasks.

The one thing that it doesn't appear to have is a means to generate checksum validations to check backups against corruption or damage, and for this I created dirvish-checksum several years ago. It creates MD5 and SHA1 checksums of backups and detects the hard links, and uses checksums from previous generations where hard-links are in use, generally trying to be efficient while maintaining valid checksums.

Over the years I have adapted and customised the basic script for specific needs of companies I have worked as well as friends that have used dirivish, while updating and extending the original script for my own needs. Most of the updates to my script have been performance related and include caching mechanisms to make the process far more efficient and improved handling of data within the script.

What I have only recently discovered is that the publicly available version from many years back had been removed from this site when I created this blog, and now it returns. There are still plenty more refinements possible and things in the code that I will clear up with time so keep an eye on this page.

Use

By default, like dirvish it's self, dirvish-checksum will look at /etc/dirvish/master.conf for config and then read in the config specified by the --config=<configfile> option.

Then it will run through all the vaults in all the banks specified in the config files and attempt to create MD5SUMS.bz2 and SHA1SUMS.bz2 for the backup. Optionally you can disable each of these with the --nomd5 and --nosha1 options.

An optimisation in more recent versions is to read index.gz to avoid having to stat every file to find it's inode. This dramatically speeds up the process of determining if the new generation of a file is hard linked to the previous one.

In cases where the new file is found to be hard linked to the previous generation, there is no point in checksumming again as the data will be the same. The only situation where data may have change is if there has been some corruption and that should be found when you do periodic checks on your backups anyway.

In most cases you can run dirvish-checksum with no arguments and it will pick up your config and generate the checksums. The first run will be slow as it will have to checksum every file in a vault, however subsequent generations will be far more efficient as it will be only dealing with changes.

Validating checksums

This uses the standard -c option with md5sum or sha1sum. The syntax is identical for both:

.../tree# bzcat ../MD5SUMS.bz2 | md5sum -c

It is a good idea to validate the checksums periodically as with the hard linking if the file data of a file becomes corrupted then it can mean that multiple (perhaps all) generations of that file are also going to be corrupted as they contain the same data.

Download / Install

Download: dirvish-checksum Perl script on GitHub

This will need making executable (chmod +x) and placing in a suitable place. Typically this would be somewhere like /usr/local/sbin.

I run this as part of my backup script which includes setting up ssh keys, checking availability of devices (eg. laptops, phones) and ensuring that many causes of failure are avoided.

The script will require Perl, plus md5sum, sha1sum, bzip2 and find commands to be available. Otherwise it doesn't depend on anything that should not be normally available.

Comments:

Dan Image  29/09/2012 12:51 :: Dan

Hi Glen,

Feedback:

- please continue this work! Maybe you could talk to J.W. Schultz to include your checksum feature into the dirvish project?
- recognising my vaults was difficult because I used "image-default: %Y%m%d%H%M%S" (I changed "\d{8}" to "\d{14}")
-  debug option prints no output

I understand that your script generates MD5SUMS.bz2, SHA1SUMS.bz2 files (3) below the new fault folder after I have made a new image file backup with for example "/usr/bin/dirvish --vault client1-1MB_Storage_pics" (1) and after I have used your script (2).

How do I recognise that the previous file has changed according to md5 and that new added other files are ignored or are going to be OK.Is the validation result (files are OK, files are not OK) supposed to be included in your script?

Thanks! Dan

Dan Image  29/09/2012 13:14 :: Dan

More questions: I think you want to save time while using index.gz (=files which could be  hard linked) to skip checksum calculations. But how do you recognise corruption if one file comes from the previous generation?  

Dan Image  29/09/2012 15:44 :: Dan

At the moment I will stick to the following:

find tree -type f -print0 | xargs -0 md5sum >> tree_old.md5
find tree -type f -print0 | xargs -0 md5sum >> tree_new.md5
diff 20120929133020/tree_old.md5 20120929151851/tree_new.md5
< d5232e1cbe011120787690cc800226da  tree/airplain.tif
---
> 7a3571bd2732854929019efe7732ade8  tree/airplain.tif

Dan Image  29/09/2012 16:16 :: Dan

Finaly I will concentrate on this here (time is not a problem for my; I need to understand the funcionality):

find tree -type f -print0 | xargs -0 md5sum >> MD5SUMS && bzip2 MD5SUMS
bzcat MD5SUMS.bz2 | md5sum -c 1> /dev/null {compare within the same generation; only show files that differ}
diff <(bzcat 20120929133020/MD5SUMS.bz2) <(bzcat 20120929151851/MD5SUMS.bz2) {compare two generations}

Glen Pitt-Pladdy Image  29/09/2012 16:53 :: Glen Pitt-Pladdy

Hi Dan

Not sure if you are using this the way I intended. My aim was to efficiently produce a checksum of each backup so that I could verify the integrity of that backup at a later date. This has proved invaluable with USB disks going flaky and corrupting backups.

You can do a validation of the backup by picking up the checksum files and checking them against the snapshot you made the backup from.

Hope that clarifies how it's intended to be used. If you just want to know changed files then sort and diff the indexes - new files won't be hard linked so the inode will be different.




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately