Atom Feed
Comments Atom Feed

Similar Articles

2013-09-07 19:38
Testing Drives
2009-10-31 14:46
SMART stats on Cacti (via SNMP)
2014-02-16 09:50
Clicky hard drives
2012-01-18 11:49
Recovering failing disks
2012-04-09 08:45
Filesystems for USB drive Backup

Recent Articles

2019-07-28 16:35
git http with Nginx via Flask wsgi application (git4nginx)
2018-05-15 16:48
Raspberry Pi Camera, IR Lights and more
2017-04-23 14:21
Raspberry Pi SD Card Test
2017-04-07 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
2017-03-28 13:07
Kubernetes to learn Part 4

Glen Pitt-Pladdy :: Blog

Rapid drive failure (as seen by Cacti)

I've got extensive monitoring of systems and with the Cacti templates I have that includes SMART and iostat.

Last night I had an unusually rapid deterioration in drive health. It's part of a RAID5 volume, but surprisingly I had been expecting another drive to fail first having watched it gradually deteriorating over the past year.

Previously I've seen parameters deteriorated over months. The thing that was unexpected with this was how rapidly the drive deteriorated - in just over an hour it went from 100% health to warning of impending failure.

The SMART point of view

This graph using my Cacti SMART templates says it all:

Cacti SMART unhealthy drive

While it appears to have deteriorated rapidly, it hasn't been failed from the array - I expect that to happen as soon as it runs out of space to reallocate sectors to.

The iostat point of view

What is curious though is comparing the iostat graphs to others in the array. Theoretically IO should be evenly spread across all drives in the array, but the unhealthy drive is distinctly different on many areas.

Unhealthy drive %Utilisation:

Cacti iostat %Util unhealthy drive

Healthy drive %Utilisation:

Cacti iostat %Util healthy drive

During the period of deterioration this distinctly increased, but this is really just an indication of how much of the drive's bandwidth is being used - if the drive is busy reallocating sectors then there's no surprise that it's going to have less bandwidth for normal IO.

Unhealthy drive await:

Cacti iostat R/W await unhealthy drive

Healthy drive await:

Cacti iostat R/W await healthy drive

The same thing happening here - IO is taking distinctly longer during the deterioration when sectors are being reallocated.

Unhealthy drive svctm:

Cacti iostat svctm unhealthy drive

Healthy drive svctm:

Cacti iostat svctm healthy drive

This is very obvious - when we eliminate the amount of IO from the equation (yes, I know this can't really be trusted for that when concurrent IO is happening) we see the unhealthy drive stand apart most clearly.



Note: Identity details will be stored in a cookie. Posts may not appear immediately