Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Similar Articles

13/12/2015 11:07
Linux RAID (mdadm) Rewriting Bad Blocks
09/03/2016 08:10
Linux md (RAID5/6) Stripe Cache monitoring on Cacti vi SNMP

Recent Articles

21/03/2017 15:18
Kubernetes to learn Part 2
21/03/2017 13:53
Kubernetes to learn Part 1
17/07/2016 15:23
AWS ssh known_host sync
11/07/2016 08:42
File integrity and log anomaly auditing Updated (like fcheck/logcheck)
30/05/2016 13:09
Xenial LXC Container on Debian

Glen Pitt-Pladdy :: Blog

Linux RAID (mdadm) and rebuild tuning

Every now and then it's necessary to replace a disk in an array and rebuilding a large RAID5/6 array was looking like it was going to take over 10 hours running at around 50MB/s

With even cheap desktop SATA drives these days you should be able to achieve comfortably over 100MB/s so improvement was clearly possible.

Many of these techniques will also apply to tuning for Linear IO (eg. streaming a large video file) but in my case I've got a pattern of small random IO so more conservative values are suitable for normal running.

The rebuild speed can be seen with:

$ cat /proc/mdstat

Speed Limits

There are adjustable limits on rebuild speed which can be checked with:

$ cat /sys/block/mdX/md/sync_speed_max
2000000 (local)

That is a limit of 200MB/s which should be enough for an array of SATA drives - chances are few will be able to achieve that sort of speed anyway.

Unnecessary IO

Linux will vary the speed of rebuild to give priority to regular IO over the rebuild. This lowers the impact of the rebuild on normal running of the system, but it also means that if you can cut down on IO it will increase the rebuild speed.

To identify possible causes of IO I ran

$ iostat -Nx 10

After a while the volumes that are getting IO could be seen - as expected /var/ was the main area and a few unnecessary processes polling files / updating things could be stopped.

This did mean that the speed remained a bit above 50MB/s for longer, but never reached over 60MB/s.

Tuning Stripe Cache

During a rebuild all drives have to be read in order to rebuild the array. For this the caching of stripes is a critical factor and if lots of small reads have to be made rather than large reads then performance will suffer. To avoid this a large Stripe Cache helps enormously. To understand if you are maxing the Stripe Cache look at the existing size:

$ cat /sys/block/mdX/md/stripe_cache_size
256

That's tiny so it's likely you will fully use it all during a rebuild as can be seen:

$ cat /sys/block/mdX/md/stripe_cache_active
256

So increase the size (double) and repeat until you see the the cache is no longer being maxed out and the speed increases. I ended up seeing no significant benefit above 8192 so stopped at:

# echo 16384 >/sys/block/mdX/md/stripe_cache_size

Roughly:

  • 4096 produced 90MB/s
  • 8192 produced 105MB/s (only full part the time)
  • 16384 produced 120MB/s (rarely full)

Increasing this further doesn't seem to make any difference in my case which is understandable when we're not fully using 16384 most the time.

Tuning this is something you might like to do during normal use anyway since many usage patterns will benefit from higher than the default 256.

NCQ

Native Command Queuing is something that people have reported causing problems. To test this I've tried disabling it on all the devices in the array:

# echo 1 >/sys/block/sdX/device/queue_depth

I'm not seeing any obvious difference due to this but there may be 1-2% which gets lost in the noise.

Read Ahead

This is simply anticipating that more data will be read and reading it ahead of time which would be very relevant to rebuilding an array. Unfortunately in my tests it seems this goes counter to what would be expected with the default 256 on each drive (not the md device) giving the fastest rebuild - raising this in my case seems to slow things down.

Other Stuff

The Linux RAID performance page has some other things around tuning which may be useful, but in my testing the above is what has worked for me.

Another thought is that at this speed one core seems to be maxed so maybe that's the problem here - this is really the limit of the CPU?

I also notice that one drive is being read at about 15% faster than the rest - I'm guessing this is resulting in some kind of bottleneck but I don't know why this would be.

That said, the overall data transfer rate is about 840MB/s which is impressive, and will result in an approx 5hour rebuild across 7x 3TB drives. Not bad!

Comments:




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately