Glen Pitt-Pladdy :: BlogRecovering failing disks | |||
The story goes that it's not a matter of if a disk will fail, but when. Eventually all disks will wear out though it's fairly unpredictable when exactly. In my case the main drive in my workstation has started to show signs of trouble - it's done about 4 years of very hard work so no real surprise it's starting to show wear. Something was clearly up when I started up my workstation this morning and it needed some investigation. Backups... or should that be restoresI can't stress the importance of backups enough. At some point a failure will occur - that's certain. The thing to think about is what impact the failure would have. It never ceases to amaze me how many people run without backups or at best take the integrity of the backups for granted. Backups are one thing but the key is the ability to restore and it's often assumed that exists until an emergency situation when something prevents the restore working. Make sure that you can demonstrate working restores, not just backups. SMARTMy experience with SMART varies - sometimes it is effective at giving an early warning, but often not. Generally the first signs of trouble are the kernel logs which show something like:
Jan 18 11:34:37 machine kernel: [ 146.230338] ata3.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x0 At that point it's time to take notice. How effective the SMART seems to depend an awful lot on the vendor. In my experience some makes are prone to reporting that the disk is completely healthy when it is clearly quite dead. Fortunately I'm using a Seagate here and my experience with SMART on Seagate drives is that it's generally more dependable than average. Running a full SMART test is often useful: # smartctl -t long /dev/sda ... and then once it's finished take a look what happened: # smartctl -a /dev/sda In this case the test bailed part way through with an error logged:
SMART Self-test log structure revision number 1 That's also usefully given us the position of the error which we can verify with dd: # dd if=/dev/sda of=/dev/null bs=512 skip=277894547 ... with I/O errors returned and logged. Subsequent to running the test SMART has also recognised it's got a bad sector:
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1 Interestingly it still thinks that the drive is healthy and hasn't started showing any problems with any of the parameters. It's arguable that it's right - this may just be a one-off bad sector and the disk is otherwise healthy and have many years left. In many cases bad sectors will be quietly reallocated without the user even knowing it's happened. In this case I got a bit unlucky and it hasn't reallocated on it's own. Tracking down the failureJust knowing that we have a problem disk is not enough - we need to know what is affected and be able to make sound decisions based on solid evidence. There is a howto on dealing with bad blocks from the smartmontools people which explains much of the process. In my case I am using LVM on a regular partition so it went like this: Find the partition involved - it's a bit obvious but if it's not then work it out from this:
# sfdisk -luS /dev/sda Clearly the main LVM partition - in fact around the middle. Figure out what LVs are affected - a bit more tricky. We need to figure out where the start is ov the PEs on the LV first:
# pvs -o+pe_start /dev/sda2 Also get the PE size of the volume:
# pvdisplay -c /dev/sda2 | awk -F: '{print $8}' Now we can add that onto the sector position, remembering that sectors are normally 512 bytes (0.5KiB) so to work out the PE position we go:
So realy we are looking for PE 33801. Then to find what is affected by the failure:
# lvdisplay --maps |egrep 'Physical|LV Name|Type' And there we have it - it's the /home volume and not surprising it's also the one that gets the most thrashing. Next we find what files (if any) are affected - in my case I'm running JFS so finding the file is a bit more tricky, but fortunately there is a way to find the files on JFS in a bad block. A short cut for smaller filesystems is just to brute force read everything and see what has errors which may be quicker than working it all out for small volumes of data: # find /home/ -mount -type f -print -exec md5sum {} \; 2>&1 | grep 'Input/output error' That tries to checksum everything and we filter it for any I/O Errors. To track it down manually: we know that we are 33801 * 4096 KiBytes into the filesystem or 33801.041 * 1024 JFS blocks (4KiB) into the filesystem which is block number 34612265, so we can fire up jfs_debug:
# jfs_debugfs /dev/VG00/home That will give you a listing for that block - look for "di_number" which is the inode number. You will probably want to do a fe blocks onwards too and note down the inode numbers. Then we find matching files: # find /home/ -mount -inum 131368 In my case a file in my Firefox cache was affected which is a bit of luck since it's easy to do without the file. I stopped Firefox and set to work on the file. Fixing - forcing a reallocationDisks should try and reallocate bad sectors to spare space on the disk though it can take a bit of work to give them enough of a kick for this to happen. First thing is to try and write to the bad area. In my case I would prefer to just write to the affected file rather than write data over that part of the disk. If I get it wrong with the disk then the damage can be far greater. If the file is not in an allocated area then one trick is write a file that fills up all the free space, but probably best not to be running anything that will need to write to the volume while you do this: # dd if=/dev/zero of=zerofile A trick for isolating a file is to use the loop device: # losetup /dev/loop0 .mozilla/firefox/y4sewdmn.default/Cache/996230C0d01 That makes /dev/loop0 map onto the file and writing to /dev/loop0 is contained within the file: # dd if=/dev/zero of=/dev/loop0 In my case dd was happy but the kernel still logged loads of errors including "auto reallocate failed", and md5summing the affected file still gave I/O errors. One trick I have learned is that the drive firmware sometimes needs a reboot to get it to take action so I powered down the machine and tried again after. This time success: the file was successfully written and checksummed with no errors. To disconnect the loop device: # losetup -d /dev/loop0 At this point I simply removed the file as it's benign, but you may want to restore the file from backup if it's important. Second tryAfter the initial recovery I ran a SMART test again for safety and it shows more errors:
SMART Self-test log structure revision number 1 This suggests that the drive has got a more general problem with wear and another area on the disk is also now showing problems. The same checks and work was done to repair this, but it would probably be wise to plan replacing the disk soon. |
|||
Disclaimer: This is a load of random thoughts, ideas and other nonsense and is not intended to be taken seriously. I have no idea what I am doing with most of this so if you are stupid and naive enough to believe any of it, it is your own fault and you can live with the consequences. More importantly this blog may contain substances such as humor which have not yet been approved for human (or machine) consumption and could seriously damage your health if taken seriously. If you still feel the need to litigate (or whatever other legal nonsense people have dreamed up now), then please address all complaints and other stupidity to yourself as you clearly "don't get it".
Copyright Glen Pitt-Pladdy 2008-2023
|
Comments: