Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Similar Articles

31/08/2013 08:16
Rapid drive failure (as seen by Cacti)
16/02/2014 09:50
Clicky hard drives

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

Testing Drives

With the 2011 Thailand Floods and the subsequent hike in hard drive prices (many drives or parts were produced in Thailand) I started buying used drives on eBay for a fraction of the prices of new drives at the time. I have nearly 20 drives in use between the arrays I use and backups and with drives having a limited working life, I typically need to replace a drive every few months. At the time of writing I have just experienced a near-failure (had a spare ready), and monitoring is indicating another two drives which are "at risk" - that could mean they fail in the next day or the next year... perhaps even get replaced in an upgrade cycle before they fail.

With the market settling again many used drive prices are now far less attractive (in fact many people seem to be trying to sell drives for more than new prices), and in my experience there are many more (up to about 1/3) unhealthy used drives being sold lately, I now buy a larger proportion of new drives again. None the less I decided to write about the drive testing approach I've developed with this experience.

There is at least a higher perceived risk of failure with used drives, though new drives are far from failure free, and theoretically a far greater risk of a set (eg. for RAID) of new drives having all come from the same batch, shipment or palette, suffered the same rough handling and being prone to multiple drive failures in an array, especially with long rebuild times with large drives and the high stress on drives during rebuild.

In my experience this testing approach is less effective for new drives though having seen many "infant deaths" with new drives I still follow this process as a precaution before using any drives, new or used.

The approach used here uses Linux tools, though there are ports or equivalent tools for other operating systems.

A note on "easy" drive test software

My experience is that this is often not for trusting, especially tools from drive vendors. After pressing vendors, I have in the past had them accept drives back after large scale failures of a model at a company I worked for despite their test tools claiming the drives to be healthy.

I suspect that reducing support calls is more of a priority with vendor tools than comprehensively identifying risks with drive health.

Other tools seem to vary in quality a lot and I'm not convinced how much experience with real world drive testing the authors of some of these tools have. I leave you to decide how much you trust them.

Unless the tool is well documented and explains what it does and why, and all that stacks up I would avoid them. Personally I would always avoid tools which which make claims of comprehensive testing but won't go into details. This is no different from other scams where grandiose claims are mode with no evidence to back it.

Good tools should give your the detailed technical evidence and explanation of how they work and why they should be trusted.

Warning

Methods discussed here carry risk of data loss and even bricking healthy drives, especially if not done properly. Make sure you fully understand the whole process and risks before doing this.

This overall process will destroy data on the drive being tested.

Initial health

SMART monitoring is available on all modern drives I've come across, and a good many other devices (eg. Compact Flash). While it's a long way from perfect as well as variable between manufacturers, it none the less gives insights into the health of many drive internals which would otherwise not be available.

SMART parameters need careful interpretation, especially since different drive manufacturers represent them in different ways. Small mounts of degradation in some parameters may indicate an unhealthy drive where others are massively variable and mean little unless they veer strongly towards the failure threshold. Some manufacturers also seem prone to drives being unusable well before their SMART parameters start to indicate the slightest problem.

The first thing I do upon receiving a drive is to capture the SMART info to a file:

# smartctl -a /dev/sdX >/path/to/drive-asreceived.smart

Where "/dev/sdX" is the device for the drive you are testing.

At this point the parameters aren't that valuable though there are a few things to look out for:

  • Overall health and any parameters which have failed
  • SMART Error Logs
  • Any tests logged especially any that have not completed and have a LBA_of_first_error
  • Parameters that have degraded significantly - though that really needs to have a reference of a know healthy drive to know what values are typical starting values for that model of drive

Particular parameters worth noting:

  • Power_On_Hours - RAW_VALUE give an idea of drive age/usage
  • Spin_Retry_Count - generally indicates mechanical problems
  • Reallocated_Event_Count - generally indicates media problems
  • Reported_Uncorrect - generally indicates media problems
  • High_Fly_Writes - generally indicates mechanical problems
  • Current_Pending_Sector - generally indicates media problems
  • Offline_Uncorrectable - generally indicates media problems
  • Load_Cycle_Count - wear indicator that has become a major problem on power efficient (especially "Green" and laptop drives) in recent years. Check the RAW_VALUE against the drive spec to get an idea of wear.

Another thing to note here is that with any errors logged there is also a time (based on Power_On_Hours) that the error was logged. This is useful if inevitably when you determine a drive is unhealthy and want to return it and the seller claims that it was healthy when they shipped it... then check how long it had been running before since the errors.

It's surprising how often sellers immediately claim it's not their problem since it must have been damaged during shipping and you have to point out that the parameter only degrades while the drive is in use and not while being shipped, plus that there are errors logged in the weeks leading up to sale.

(Re-)burn-in

At this point we do some general testing and make the drive do some work - both read and write. I normally kick off a "long" SMART test which will do general mechanical and media (read) tests, however doesn't test the interface as this happens all within the drive it's self:

# smartctl -t long /dev/sdX

On large drives that can take several hours so make sure you poll the drive periodically to make sure it doesn't get put to sleep and the test terminated.

You can check the test has completed by checking the output of the smart status command earlier (ie. don't send the output to a file).

Next, a plain ordinary read test gets data going through the interface and gives the drive some more exercise:

# badblocks -v -b 1024 /dev/sdX

The one thing I do is monitor /var/log/kern.log (Debian - it all goes into /var/log/messages in other distros) for errors being logged accessing the drive.

Read tests rarely show anything useful as modern drives reallocate sectors and have other mechanisms to recover (conceal) problems. If a drive shows the slightest problems with read tests then it's likely in a very bad state where none of the integrity mechanisms are able to recover the data.

Last, but very important, is a write test. As a large proportion of used drives have not been properly erased (as some high profile Data Protection cases have shown), this is a good opportunity to use the built in erase in most drives.

This is risky and you should read hdparm docs carefully to establish that the setup and version you are using to test the drive is known to be good (USB bridges are generally a bad idea). Also, problems can occur if the system crashes, is disconnected or power is lost during the erase. Some versions of hdparm and the kernel also have problems under some circumstances. This can render otherwise healthy drives unusable.... you have been warned!

Also, in case you haven't yet realized, you should not do any of this to a drive which contains data you would like to keep.

First check the drive:

# hdparm -I /dev/sdX

Note if it has support for "security erase" or "enhanced security erase" - it's rare that drives don't have this. Also ensure the drive is "not locked" and "not frozen". Note how long it will likely take.

Next we need to enable the drive security by setting a password before we can trigger the erase:

# hdparm --user-master u --security-set-pass somepassword /dev/sdX

Then we can run the erase, at this point you need to let things run until the command completes else the drive could be bricked:

# hdparm --user-master u --security-erase-enhanced somepassword /dev/sdX

If your drive doesn't support "enhanced security erase" then simply use "--security-erase" option instead.

Another possibility (or addition) which also exercises the interface is to use "dd", however the firmware erase is potentially better as it theoretically might also erase reserved space on the drive that will not be otherwise accessible, but that's mostly optimistic speculation. The "dd" approach would be something like:

# dd if=/dev/zero of=/dev/sdX bs=4k

Again, monitoring logs is a good idea here as it may tell of problems.

Another possible approach would be write testing with "badblocks".

Depending on the level of certainty you want you can do other tests, run other test tools and generally make the drive work hard.

Final Health

Now that the drive has had a run with both read and write tests, it's time to see how it's held up. Capture the SMART parameters again:

# smartctl -a /dev/sdX >/path/to/drive-aftertest.smart

This can be done between each stage of testing above for more detail on what particular test did what to parameters.

Now you have a picture of how healthy the drive thinks it is after it's done some work. Now load the two files created into your favourite "diff" tool and examine what changed and by how much.

Pay particular attention to the parameters listed earlier. Very often parameters like High_Fly_Writes might tick over a few points with the write tests which indicates that with use the drive may be degrading rapidly towards impending failure.

Depending on how marked changes are you may like to repeat some or all of the testing and check health a second time. At this point there is a level of experience evaluating this and having other same-model drives to compare to is useful. Researching particular parameters and "normal" values is often a good idea.

Other indicators

As I have a number of arrays which should have a good balance of work between drives, one other thing that I have observed is that drives in poor health often exhibit worse performance than healthy drives prior to failure. I suspect this is due to the drive firmware silently retrying operations and/or additional error correction needed, perhaps even in some cases things like mechanical wear reducing the speed that seeking can occur.

Here is key columns of a simple "iostat -x" showing 3 drives in a RAID5 array, the last one of which is showing declining health in it's SMART parameters:

Device:     await r_await w_await  svctm  %util
sda          5.78    5.69    5.87   3.37   0.35
sdb          6.12    6.16    6.09   3.47   0.36
sdc          7.69    7.12    8.20   4.72   0.49

As you can see the last (unhealthy drive) is showing to be distinctly slower than the first two (healthy drives).

What's it worth?

There are no guarantees with drive testing other than at some point the drive will fail. My experience is that this process will often (but not always) catch unhealthy drives quick enough with sufficient evidence to demonstrate they where in ill health before shipping and return them, avoiding most failures soon after purchase.

It's always worth monitoring drives and having comprehensive, regularly tested backups no matter what.

Even with all this, drive failures can still happen rapidly with little or no warning.

Testing will not guarantee drive health, but the right types of tests can identify a large proportion of unhealthy drives that would normally go unnoticed until catastrophic failure.

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately