Menu
Index

Contact
LinkedIn
GitHub
Atom Feed
Comments Atom Feed



Tweet

Recent Articles

23/04/2017 14:21
Raspberry Pi SD Card Test
07/04/2017 10:54
DNS Firewall (blackhole malicious, like Pi-hole) with bind9
28/03/2017 13:07
Kubernetes to learn Part 4
23/03/2017 16:09
Kubernetes to learn Part 3
21/03/2017 15:18
Kubernetes to learn Part 2

Glen Pitt-Pladdy :: Blog

Home Lab Project: Storage

Previously I've written about Networking for my Home Lab system, but another important aspect is storage. Why this is so important is that it very rapidly becomes a bottleneck, often long before CPU and memory for Lab usage.

There are quite a few different approaches to this and I'm not going to claim this is any better or worse than anyone else's because everyone is just shooting for different stuff.

Requirements

In my case I don't need extreme IO performance, but I do need enough to smoothly run several VMs simultaneously or test Linux software RAID configuration changes with several drives which is a pretty good way to bring any non-SSD systems (apart from possibly large RAID10 arrays) to their knees with all the simultaneous reads/writes to different image files on the same physical storage.

The other thing is to have sufficient space for leaving projects with TBs of space usage idle for a while.

Fast, or at least good concurrency handling combined with a lot of space without spending a lot of money is generally quite a challenge.

Linux mdadm & lvm-cache

The approach I've ended up with is to use a mobo with lots of IO and simply run Linux mdadm to manage RAID storage on standard desktop drives - a combination of large mechanical and an array of small SSDs for caching.

Bulk Storage

This is a RAID6 array of currently 6x 3TB 7200RPM drives with default 512k chunks. I'm likely to add an extra drive (maybe also a spare) still which should take me up to ~15TB of space in the array. This gives decent read performance, obviously limited write performance, but most importantly robust reliability. Tuning of the stripe / chunk sizes is something that may also yield better performance, but this is going to vary depending on the workloads so an experienced guess is about as good as it's going to get with a generic Lab system. I might take it up to 1024k chunks later, but for now defaults does the job.

My experience is that larger drives seem to have shorter lives than their smaller counterparts... as if failure rates are similar per TB, but that's small samples and it could just as easily be changes in manufacturing economics or other factors over time. With rebuild times of about a day with the high stress of rebuilding an array, I'm not going to risk the N+1 redundancy of RAID5. I need to be able to tolerate another failure during rebuild. This is also important because the non-critical (experimental) data and large storage means I've decided that it's not worth/practical backing up images, but at the same time I don't want to run excessive risk of wasted time rebuilding images if a complete failure occurs.

For cost saving I've used second-hand drives off eBay. If you are patient it roughly haves the cost. If you are impatient you're likely to save nothing and waste a lot of your time. While some may think this is excessive risk, I do complete read and write testing on all drives (new or used) and so far this has effectively weeded out the infant-death new drives or people trying to palm off failing drives on eBay (yeah... hundreds of errors in the SMART logs two weeks before listing... busted!) As a matter or course I avoid drives that are mentioned as used in SAN (that probably means they've been running 24x7 and maybe had constant writes by someone running Bittorrent or similar non-stop), and also judge how much the seller knows technically. A non-technical seller will generally be decent about it if the drive turns out to be dud, and a very technical seller will gracefully accept a solid technical analysis. The worst ones in my experience are the ones who think they know a lot but don't really and they will put up crazy fights and waste loads of time with arguments like "I could format it therefore it's 100% working" and similar. They also are unable to understand technical arguments and often assume they can blind buyers with their nonsensical tech jargon... hence they tend to be the ones selling dodgy drives and thinking they can get away with it.

In arrays where I've deliberately used 50/50 new and used drives to experiment, I've seen roughly even failure rates so I've decided the saving is worth it. If you don't test though I think you will see the dodgy sellers pushing up the failure rates on used drives.

Cache

This is a RAID5 array of 3x 120GB SSDs. It's not a lot of space, but when working only a very small amount of storage is "hot" at any time so it's adequate for my needs, though if you want to spend more there is probably speed to be gained.

Booting

With the large drives I'm using GPT and with that I'm using UEFI boot. While the /boot volume is happy to work off RAID (in my case RAID1 since it's tiny), the /boot/efi partition is another story since the BIOS has to be able to read it and that's not going to understand mdadm RAID.

One work-around would be to have multiple partitions (on different drives) which are kept up to date with rsync or similar. In my case I've used a mdadm RAID1 with version 1.0 which has metadata at the end of the drive. That means that each device making up the RAID is also readble by BIOS on it's own. The only thing to watch is that you still have to set partition type EF00 for BIOS to boot off this, but mdadm still happily assembles the array.

You then need to configure each individual partition in the /boot/efi RAID1 to be tried for booting by the BIOS:

efibootmgr -c -d /dev/sda -l '\EFI\debian\grubx64.efi' -L 'Linux 0'
efibootmgr -c -d /dev/sdb -l '\EFI\debian\grubx64.efi' -L 'Linux 1'
efibootmgr -c -d /dev/sdc -l '\EFI\debian\grubx64.efi' -L 'Linux 2'

efibootmgr assumes first partition if you don't tell it, which is generally going to be the case.

Putting it together

I don't use caching for the host OS since it's not really working hard - the VMs are. Fortunately that's easy to manage with LVM with one VG that spans both RAID6 and the SSD RAID5.

For the VM image volume I do this by creating cache and metadata (small) volumes, then converting them to a cache-pool:

# lvcreate -L SIZEOFCACHE -n virtimages_cache vg00 /dev/md3
# lvcreate -L SIZEOFMETA -n virtimages_cache_meta vg00 /dev/md3
# lvconvert --type cache-pool --poolmetadata vg00/virtimages_cache_meta vg00/virtimages_cache --cachemode writethrough

The sizing is important. Basically, as much as practical for the cache, leaving of course enough for metadata. Docs say use 1/1000 the size (unless very small) for metadata so that is fairly easy to work out.

The other important thing here is the choice of write-through or write-back caching. The latter might be the default but check your docs. In my case I have RAID5 and am happy to risk a cache failure with write-through as it's quite a small chance. Depending on your requirements you might want to setup differently. Increased complexity generally means increased risk and there's always a chance of ending up loosing the cache and with it the filesystem integrity.

Then to apply the cache-pool to your main VM image volume:

# lvconvert --type cache --cachepool vg00/virtimages_cache vg00/virtimages

At this stage caching should be active. Don't expect instant results - it will take time to learn and adapt caching to the hot-spots.

Cache Removal

The basic concept for removal of the cache is simple - just remove the cache-pool LV. LVM (assuming you have a sufficiently mature version... some earlier versions are buggy) should take care of everything for you neatly.

Cache Boot Failure

One problem I encountered was rebooting. If you are using LVM cache then with Debian (and likely derivatives) you MUST install the "Suggested" package thin-provisioning-tools else the cached volume will not work on boot and that might also mean you need to do recovery stuff to boot without this volume initially.

Performance

I've not seen anything dramatic but after a short period of "learning" I'm seeing roughly a 2:1 ratio of IO to the cache with general Lab VM usage. Most important things are running acceptably smoothly with multiple VMs building simultaneously. Slow-down is uniform without stalling as the storage gets loaded up.

A VM with a 4-vDisk RAID10 (all .qcow2 files on a cached) can build the array at ~200MB/s... that's actually significantly faster than I've seen on bare metal for a similar mechanical drive configuration which is especially impressive considering all image files are on the same storage.

A more controlled test with iozone shows that on the final random IO test (after warm-up tests) we get a respectable result:

# iozone -T -t 4 -s 256m -r 4k -I -O -w -i0 -i1 -i2

This runs 4 threads writing 4k blocks with DIRECTIO to avoid memory caching skewing the results (ie. we get raw storage speed).

This yields for an initial cycle for the Random IO test:

        Children see throughput for 4 random readers    =    2040.34 ops/sec
        Parent sees throughput for 4 random readers     =    2040.29 ops/sec
        Min throughput per thread                       =     440.79 ops/sec
        Max throughput per thread                       =     549.81 ops/sec
        Avg throughput per thread                       =     510.09 ops/sec
        Min xfer                                        =   52541.00 ops

        Children see throughput for 4 random writers    =    7899.51 ops/sec
        Parent sees throughput for 4 random writers     =    6522.77 ops/sec
        Min throughput per thread                       =    1611.94 ops/sec
        Max throughput per thread                       =    2138.51 ops/sec
        Avg throughput per thread                       =    1974.88 ops/sec
        Min xfer                                        =   49399.00 ops

Then a second cycle (remember leaving the files with "-w") so we hopefully start to see the cache kicking in which it seemed to be with the last test above:

        Children see throughput for 4 random readers    =   23849.38 ops/sec
        Parent sees throughput for 4 random readers     =   23288.01 ops/sec
        Min throughput per thread                       =    3079.76 ops/sec
        Max throughput per thread                       =    7050.25 ops/sec
        Avg throughput per thread                       =    5962.34 ops/sec
        Min xfer                                        =   29418.00 ops

        Children see throughput for 4 random writers    =    7966.93 ops/sec
        Parent sees throughput for 4 random writers     =    6498.04 ops/sec
        Min throughput per thread                       =    1801.63 ops/sec
        Max throughput per thread                       =    2281.55 ops/sec
        Avg throughput per thread                       =    1991.73 ops/sec
        Min xfer                                        =   51751.00 ops

That's very respectable and not unlike figures I've seen pre-production load testing on expensive Enterprise SAN full of 15k SAS drives with SSD Cache, just for a fraction of the cost (admittedly the Enterprise SAN is going to be a whole lot more reliable, probably also better behaved under load, and overall a much more sensible choice for critical workloads!)

The combination of cost and performance certainly does the trick for what I need in a Home Lab.

SAN/NAS

Now that I've got the storage I can also use volumes for NFS or iSCSI for SAN-like experiments (eg. sharing a raw volume between cluster nodes).

I'm using IET (iSCSI Enterprise Target) which so far has worked well. Over the network the 1Gbit Ethernet is definitely the bottleneck so maybe sometime I'll look at a faster options when I start adding additional Hypervisors.

 

Comments:

John Image  17/11/2015 21:14 :: John

Glen,
I'm looking at using lvmcache as well on my host NAS/compile/KVM box and I'm wondering how you've found the performance to be over time?  And how has the stability been as well?  I'm using Debian Jessie with linux kernel v4.2.6 currently, and just pulled the trigger to buy some new disks and SSDs for the system since it was getting old and disks were starting to fail.

I'm hoping this will work well... but we'll see!

Glen Pitt-Pladdy Image  18/11/2015 19:50 :: Glen Pitt-Pladdy

I've not seen any problems yet. Older SSDs did tend to have problems as they wear and have to more aggressively wear-level or operations fail and get retried, but I've not seen any such problems with Samsung 840 EVO drives I've used (and they're fairly old now). Similar happens as magnetic drives start to fail, but I monitor drives closely and all the drives are healthy.

Since this is a Lab system it seems to get brief periods of intensive usage (multiple VMs thrashing) so not like many other applications. I'm also planning to upgrade the SSDs to larger ones to be safe to fit in the hot-spots and further improve practical performance.

John Image  09/12/2015 17:19 :: John

Well, after a whole bunch of problems with linux kernel v4.4-rc? I'm finally up and running and the cache looks to be pretty good so far.  Pulled five 3.5" disks, added two 4Tb ones and the SSDs.  Much quieter system, less heat load and it's faster for sure.  

Now to work on monitoring and trying to check performance.  

Thanks for your blog, it was a big help.

John

P.S.  I need to file a bug report on the man page for lvmcache, it doesn't explicity say up front that all LVs need to be in the same VG for both the source volume and cache LVs.  




Are you human? (reduces spam)
Note: Identity details will be stored in a cookie. Posts may not appear immediately