Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Similar Articles

07/09/2013 19:38
Testing Drives
31/10/2009 14:46
SMART stats on Cacti (via SNMP)
18/01/2012 11:49
Recovering failing disks
09/04/2012 08:45
Filesystems for USB drive Backup
11/04/2012 16:28
Benchmarking Disk performance with RAID and iostat
16/02/2014 09:50
Clicky hard drives

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

Rapid drive failure (as seen by Cacti)

I've got extensive monitoring of systems and with the Cacti templates I have that includes SMART and iostat.

Last night I had an unusually rapid deterioration in drive health. It's part of a RAID5 volume, but surprisingly I had been expecting another drive to fail first having watched it gradually deteriorating over the past year.

Previously I've seen parameters deteriorated over months. The thing that was unexpected with this was how rapidly the drive deteriorated - in just over an hour it went from 100% health to warning of impending failure.

The SMART point of view

This graph using my Cacti SMART templates says it all:

Cacti SMART unhealthy drive

While it appears to have deteriorated rapidly, it hasn't been failed from the array - I expect that to happen as soon as it runs out of space to reallocate sectors to.

The iostat point of view

What is curious though is comparing the iostat graphs to others in the array. Theoretically IO should be evenly spread across all drives in the array, but the unhealthy drive is distinctly different on many areas.

Unhealthy drive %Utilisation:

Cacti iostat %Util unhealthy drive

Healthy drive %Utilisation:

Cacti iostat %Util healthy drive

During the period of deterioration this distinctly increased, but this is really just an indication of how much of the drive's bandwidth is being used - if the drive is busy reallocating sectors then there's no surprise that it's going to have less bandwidth for normal IO.

Unhealthy drive await:

Cacti iostat R/W await unhealthy drive

Healthy drive await:

Cacti iostat R/W await healthy drive

The same thing happening here - IO is taking distinctly longer during the deterioration when sectors are being reallocated.

Unhealthy drive svctm:

Cacti iostat svctm unhealthy drive

Healthy drive svctm:

Cacti iostat svctm healthy drive

This is very obvious - when we eliminate the amount of IO from the equation (yes, I know this can't really be trusted for that when concurrent IO is happening) we see the unhealthy drive stand apart most clearly.

 

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately