Glen Pitt-Pladdy :: Blog
Linux RAID (mdadm) and rebuild tuning
Every now and then it's necessary to replace a disk in an array and rebuilding a large RAID5/6 array was looking like it was going to take over 10 hours running at around 50MB/s
With even cheap desktop SATA drives these days you should be able to achieve comfortably over 100MB/s so improvement was clearly possible.
Many of these techniques will also apply to tuning for Linear IO (eg. streaming a large video file) but in my case I've got a pattern of small random IO so more conservative values are suitable for normal running.
The rebuild speed can be seen with:
$ cat /proc/mdstat
There are adjustable limits on rebuild speed which can be checked with:
$ cat /sys/block/mdX/md/sync_speed_max
That is a limit of 200MB/s which should be enough for an array of SATA drives - chances are few will be able to achieve that sort of speed anyway.
Linux will vary the speed of rebuild to give priority to regular IO over the rebuild. This lowers the impact of the rebuild on normal running of the system, but it also means that if you can cut down on IO it will increase the rebuild speed.
To identify possible causes of IO I ran
$ iostat -Nx 10
After a while the volumes that are getting IO could be seen - as expected /var/ was the main area and a few unnecessary processes polling files / updating things could be stopped.
This did mean that the speed remained a bit above 50MB/s for longer, but never reached over 60MB/s.
Tuning Stripe Cache
During a rebuild all drives have to be read in order to rebuild the array. For this the caching of stripes is a critical factor and if lots of small reads have to be made rather than large reads then performance will suffer. To avoid this a large Stripe Cache helps enormously. To understand if you are maxing the Stripe Cache look at the existing size:
$ cat /sys/block/mdX/md/stripe_cache_size
That's tiny so it's likely you will fully use it all during a rebuild as can be seen:
$ cat /sys/block/mdX/md/stripe_cache_active
So increase the size (double) and repeat until you see the the cache is no longer being maxed out and the speed increases. I ended up seeing no significant benefit above 8192 so stopped at:
# echo 16384 >/sys/block/mdX/md/stripe_cache_size
Increasing this further doesn't seem to make any difference in my case which is understandable when we're not fully using 16384 most the time.
Tuning this is something you might like to do during normal use anyway since many usage patterns will benefit from higher than the default 256.
Native Command Queuing is something that people have reported causing problems. To test this I've tried disabling it on all the devices in the array:
# echo 1 >/sys/block/sdX/device/queue_depth
I'm not seeing any obvious difference due to this but there may be 1-2% which gets lost in the noise.
This is simply anticipating that more data will be read and reading it ahead of time which would be very relevant to rebuilding an array. Unfortunately in my tests it seems this goes counter to what would be expected with the default 256 on each drive (not the md device) giving the fastest rebuild - raising this in my case seems to slow things down.
The Linux RAID performance page has some other things around tuning which may be useful, but in my testing the above is what has worked for me.
Another thought is that at this speed one core seems to be maxed so maybe that's the problem here - this is really the limit of the CPU?
I also notice that one drive is being read at about 15% faster than the rest - I'm guessing this is resulting in some kind of bottleneck but I don't know why this would be.
That said, the overall data transfer rate is about 840MB/s which is impressive, and will result in an approx 5hour rebuild across 7x 3TB drives. Not bad!
Copyright Glen Pitt-Pladdy 2008-2017