Menu
Index

Contact
Atom Feed
Comments Atom Feed

Similar Articles

2009-10-31 14:46
SMART stats on Cacti (via SNMP)
2013-06-23 18:30
Linux Mint 13 Maya (also Ubuntu Precise 12.04) Migration to LUKS+LVM
2013-08-31 08:16
Rapid drive failure (as seen by Cacti)
2013-09-07 19:38
Testing Drives
2015-08-01 21:13
Home Lab Project: Storage

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

Recovering failing disks

The story goes that it's not a matter of if a disk will fail, but when. Eventually all disks will wear out though it's fairly unpredictable when exactly.

In my case the main drive in my workstation has started to show signs of trouble - it's done about 4 years of very hard work so no real surprise it's starting to show wear.

Something was clearly up when I started up my workstation this morning and it needed some investigation.

Backups... or should that be restores

I can't stress the importance of backups enough. At some point a failure will occur - that's certain. The thing to think about is what impact the failure would have.

It never ceases to amaze me how many people run without backups or at best take the integrity of the backups for granted. Backups are one thing but the key is the ability to restore and it's often assumed that exists until an emergency situation when something prevents the restore working.

Make sure that you can demonstrate working restores, not just backups.

SMART

My experience with SMART varies - sometimes it is effective at giving an early warning, but often not. Generally the first signs of trouble are the kernel logs which show something like:

Jan 18 11:34:37 machine kernel: [  146.230338] ata3.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x0
Jan 18 11:34:37 machine kernel: [  146.230342] ata3.00: irq_stat 0x40000008
Jan 18 11:34:37 machine kernel: [  146.230346] ata3.00: failed command: READ FPDMA QUEUED
Jan 18 11:34:37 machine kernel: [  146.230352] ata3.00: cmd 60/38:08:5e:55:90/00:00:10:00:00/40 tag 1 ncq 28672 in
Jan 18 11:34:37 machine kernel: [  146.230353]          res 41/40:38:94:55:90/00:00:10:00:00/40 Emask 0x409 (media error) <F>
Jan 18 11:34:37 machine kernel: [  146.230356] ata3.00: status: { DRDY ERR }
Jan 18 11:34:37 machine kernel: [  146.230358] ata3.00: error: { UNC }
...

At that point it's time to take notice. How effective the SMART seems to depend an awful lot on the vendor. In my experience some makes are prone to reporting that the disk is completely healthy when it is clearly quite dead. Fortunately I'm using a Seagate here and my experience with SMART on Seagate drives is that it's generally more dependable than average. Running a full SMART test is often useful:

# smartctl -t long /dev/sda

... and then once it's finished take a look what happened:

# smartctl -a /dev/sda

In this case the test bailed part way through with an error logged:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       50%     12628         277894548
# 2  Extended offline    Completed without error       00%     12482         -
....

That's also usefully given us the position of the error which we can verify with dd:

# dd if=/dev/sda of=/dev/null bs=512 skip=277894547

... with I/O errors returned and logged.

Subsequent to running the test SMART has also recognised it's got a bad sector:

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1

Interestingly it still thinks that the drive is healthy and hasn't started showing any problems with any of the parameters. It's arguable that it's right - this may just be a one-off bad sector and the disk is otherwise healthy and have many years left. In many cases bad sectors will be quietly reallocated without the user even knowing it's happened. In this case I got a bit unlucky and it hasn't reallocated on it's own.

Tracking down the failure

Just knowing that we have a problem disk is not enough - we need to know what is affected and be able to make sound decisions based on solid evidence.

There is a howto on dealing with bad blocks from the smartmontools people which explains much of the process. In my case I am using LVM on a regular partition so it went like this:

Find the partition involved - it's a bit obvious but if it's not then work it out from this:

# sfdisk -luS /dev/sda

Disk /dev/sda: 30401 cylinders, 255 heads, 63 sectors/track
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1            63    996029     995967  83  Linux
/dev/sda2        996030 488392064  487396035  8e  Linux LVM
/dev/sda3             0         -          0   0  Empty
/dev/sda4             0         -          0   0  Empty

Clearly the main LVM partition - in fact around the middle.

Figure out what LVs are affected - a bit more tricky. We need to figure out where the start is ov the PEs on the LV first:

# pvs -o+pe_start /dev/sda2
  PV         VG   Fmt  Attr PSize   PFree  1st PE
  /dev/sda2  VG00 lvm2 a-   232.41g 46.41g 192.00k

Also get the PE size of the volume:

# pvdisplay -c /dev/sda2 | awk -F: '{print $8}'
4096

Now we can add that onto the sector position, remembering that sectors are normally 512 bytes (0.5KiB) so to work out the PE position we go:

  • Sector position on partition = 277894548 - 996030 = 276898518
  • KiByte Offset on partition = 276898518 / 2 = 138449259
  • KiByte Offset from start of PEs = 138449259 - 192 = 138449067
  • PE Position on PV = 138449067 / 4096 = 33801.041...

So realy we are looking for PE 33801. Then to find what is affected by the failure:

# lvdisplay --maps |egrep 'Physical|LV Name|Type'
.....
  LV Name                /dev/VG00/home
    Type        linear
    Physical volume    /dev/sda2
    Physical extents    16896 to 47615
.....

And there we have it - it's the /home volume and not surprising it's also the one that gets the most thrashing.

Next we find what files (if any) are affected - in my case I'm running JFS so finding the file is a bit more tricky, but fortunately there is a way to find the files on JFS in a bad block.

A short cut for smaller filesystems is just to brute force read everything and see what has errors which may be quicker than working it all out for small volumes of data:

# find /home/ -mount -type f -print -exec md5sum {} \; 2>&1 | grep 'Input/output error'

That tries to checksum everything and we filter it for any I/O Errors.

To track it down manually: we know that we are 33801 * 4096 KiBytes into the filesystem or 33801.041 * 1024 JFS blocks (4KiB) into the filesystem which is block number 34612265, so we can fire up jfs_debug:

# jfs_debugfs /dev/VG00/home
jfs_debugfs version 1.1.12, 24-Aug-2007

Aggregate Block Size: 4096

> display 34612265 0 i

That will give you a listing for that block - look for "di_number" which is the inode number. You will probably want to do a fe blocks onwards too and note down the inode numbers. Then we find matching files:

# find /home/ -mount -inum 131368

In my case a file in my Firefox cache was affected which is a bit of luck since it's easy to do without the file. I stopped Firefox and set to work on the file.

Fixing - forcing a reallocation

Disks should try and reallocate bad sectors to spare space on the disk though it can take a bit of work to give them enough of a kick for this to happen. First thing is to try and write to the bad area. In my case I would prefer to just write to the affected file rather than write data over that part of the disk. If I get it wrong with the disk then the damage can be far greater.

If the file is not in an allocated area then one trick is write a file that fills up all the free space, but probably best not to be running anything that will need to write to the volume while you do this:

# dd if=/dev/zero of=zerofile

A trick for isolating a file is to use the loop device:

# losetup /dev/loop0 .mozilla/firefox/y4sewdmn.default/Cache/996230C0d01

That makes /dev/loop0 map onto the file and writing to /dev/loop0 is contained within the file:

# dd if=/dev/zero of=/dev/loop0

In my case dd was happy but the kernel still logged loads of errors including "auto reallocate failed", and md5summing the affected file still gave I/O errors.

One trick I have learned is that the drive firmware sometimes needs a reboot to get it to take action so I powered down the machine and tried again after. This time success: the file was successfully written and checksummed with no errors. To disconnect the loop device:

# losetup -d /dev/loop0

At this point I simply removed the file as it's benign, but you may want to restore the file from backup if it's important.

Second try

After the initial recovery I ran a SMART test again for safety and it shows more errors:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       50%     12631         309282282
# 2  Extended offline    Completed: read failure       50%     12628         277894548
# 3  Extended offline    Completed without error       00%     12482         -

This suggests that the drive has got a more general problem with wear and another area on the disk is also now showing problems.

The same checks and work was done to repair this, but it would probably be wise to plan replacing the disk soon.

Comments:




Note: Identity details will be stored in a cookie. Posts may not appear immediately