Glen Pitt-Pladdy :: BlogHome Lab Project: Storage | |||
Previously I've written about Networking for my Home Lab system, but another important aspect is storage. Why this is so important is that it very rapidly becomes a bottleneck, often long before CPU and memory for Lab usage. There are quite a few different approaches to this and I'm not going to claim this is any better or worse than anyone else's because everyone is just shooting for different stuff. RequirementsIn my case I don't need extreme IO performance, but I do need enough to smoothly run several VMs simultaneously or test Linux software RAID configuration changes with several drives which is a pretty good way to bring any non-SSD systems (apart from possibly large RAID10 arrays) to their knees with all the simultaneous reads/writes to different image files on the same physical storage. The other thing is to have sufficient space for leaving projects with TBs of space usage idle for a while. Fast, or at least good concurrency handling combined with a lot of space without spending a lot of money is generally quite a challenge. Linux mdadm & lvm-cacheThe approach I've ended up with is to use a mobo with lots of IO and simply run Linux mdadm to manage RAID storage on standard desktop drives - a combination of large mechanical and an array of small SSDs for caching. Bulk StorageThis is a RAID6 array of currently 6x 3TB 7200RPM drives with default 512k chunks. I'm likely to add an extra drive (maybe also a spare) still which should take me up to ~15TB of space in the array. This gives decent read performance, obviously limited write performance, but most importantly robust reliability. Tuning of the stripe / chunk sizes is something that may also yield better performance, but this is going to vary depending on the workloads so an experienced guess is about as good as it's going to get with a generic Lab system. I might take it up to 1024k chunks later, but for now defaults does the job. My experience is that larger drives seem to have shorter lives than their smaller counterparts... as if failure rates are similar per TB, but that's small samples and it could just as easily be changes in manufacturing economics or other factors over time. With rebuild times of about a day with the high stress of rebuilding an array, I'm not going to risk the N+1 redundancy of RAID5. I need to be able to tolerate another failure during rebuild. This is also important because the non-critical (experimental) data and large storage means I've decided that it's not worth/practical backing up images, but at the same time I don't want to run excessive risk of wasted time rebuilding images if a complete failure occurs. For cost saving I've used second-hand drives off eBay. If you are patient it roughly haves the cost. If you are impatient you're likely to save nothing and waste a lot of your time. While some may think this is excessive risk, I do complete read and write testing on all drives (new or used) and so far this has effectively weeded out the infant-death new drives or people trying to palm off failing drives on eBay (yeah... hundreds of errors in the SMART logs two weeks before listing... busted!) As a matter or course I avoid drives that are mentioned as used in SAN (that probably means they've been running 24x7 and maybe had constant writes by someone running Bittorrent or similar non-stop), and also judge how much the seller knows technically. A non-technical seller will generally be decent about it if the drive turns out to be dud, and a very technical seller will gracefully accept a solid technical analysis. The worst ones in my experience are the ones who think they know a lot but don't really and they will put up crazy fights and waste loads of time with arguments like "I could format it therefore it's 100% working" and similar. They also are unable to understand technical arguments and often assume they can blind buyers with their nonsensical tech jargon... hence they tend to be the ones selling dodgy drives and thinking they can get away with it. In arrays where I've deliberately used 50/50 new and used drives to experiment, I've seen roughly even failure rates so I've decided the saving is worth it. If you don't test though I think you will see the dodgy sellers pushing up the failure rates on used drives. CacheThis is a RAID5 array of 3x 120GB SSDs. It's not a lot of space, but when working only a very small amount of storage is "hot" at any time so it's adequate for my needs, though if you want to spend more there is probably speed to be gained. BootingWith the large drives I'm using GPT and with that I'm using UEFI boot. While the /boot volume is happy to work off RAID (in my case RAID1 since it's tiny), the /boot/efi partition is another story since the BIOS has to be able to read it and that's not going to understand mdadm RAID. One work-around would be to have multiple partitions (on different drives) which are kept up to date with rsync or similar. In my case I've used a mdadm RAID1 with version 1.0 which has metadata at the end of the drive. That means that each device making up the RAID is also readble by BIOS on it's own. The only thing to watch is that you still have to set partition type EF00 for BIOS to boot off this, but mdadm still happily assembles the array. You then need to configure each individual partition in the /boot/efi RAID1 to be tried for booting by the BIOS: efibootmgr -c -d /dev/sda -l '\EFI\debian\grubx64.efi' -L 'Linux 0' efibootmgr assumes first partition if you don't tell it, which is generally going to be the case. Putting it togetherI don't use caching for the host OS since it's not really working hard - the VMs are. Fortunately that's easy to manage with LVM with one VG that spans both RAID6 and the SSD RAID5. For the VM image volume I do this by creating cache and metadata (small) volumes, then converting them to a cache-pool: # lvcreate -L SIZEOFCACHE -n virtimages_cache vg00 /dev/md3 The sizing is important. Basically, as much as practical for the cache, leaving of course enough for metadata. Docs say use 1/1000 the size (unless very small) for metadata so that is fairly easy to work out. The other important thing here is the choice of write-through or write-back caching. The latter might be the default but check your docs. In my case I have RAID5 and am happy to risk a cache failure with write-through as it's quite a small chance. Depending on your requirements you might want to setup differently. Increased complexity generally means increased risk and there's always a chance of ending up loosing the cache and with it the filesystem integrity. Then to apply the cache-pool to your main VM image volume: # lvconvert --type cache --cachepool vg00/virtimages_cache vg00/virtimages At this stage caching should be active. Don't expect instant results - it will take time to learn and adapt caching to the hot-spots. Cache RemovalThe basic concept for removal of the cache is simple - just remove the cache-pool LV. LVM (assuming you have a sufficiently mature version... some earlier versions are buggy) should take care of everything for you neatly. Cache Boot FailureOne problem I encountered was rebooting. If you are using LVM cache then with Debian (and likely derivatives) you MUST install the "Suggested" package thin-provisioning-tools else the cached volume will not work on boot and that might also mean you need to do recovery stuff to boot without this volume initially. PerformanceI've not seen anything dramatic but after a short period of "learning" I'm seeing roughly a 2:1 ratio of IO to the cache with general Lab VM usage. Most important things are running acceptably smoothly with multiple VMs building simultaneously. Slow-down is uniform without stalling as the storage gets loaded up. A VM with a 4-vDisk RAID10 (all .qcow2 files on a cached) can build the array at ~200MB/s... that's actually significantly faster than I've seen on bare metal for a similar mechanical drive configuration which is especially impressive considering all image files are on the same storage. A more controlled test with iozone shows that on the final random IO test (after warm-up tests) we get a respectable result: # iozone -T -t 4 -s 256m -r 4k -I -O -w -i0 -i1 -i2 This runs 4 threads writing 4k blocks with DIRECTIO to avoid memory caching skewing the results (ie. we get raw storage speed). This yields for an initial cycle for the Random IO test: Children see throughput for 4 random readers = 2040.34 ops/sec Then a second cycle (remember leaving the files with "-w") so we hopefully start to see the cache kicking in which it seemed to be with the last test above: Children see throughput for 4 random readers = 23849.38 ops/sec That's very respectable and not unlike figures I've seen pre-production load testing on expensive Enterprise SAN full of 15k SAS drives with SSD Cache, just for a fraction of the cost (admittedly the Enterprise SAN is going to be a whole lot more reliable, probably also better behaved under load, and overall a much more sensible choice for critical workloads!) The combination of cost and performance certainly does the trick for what I need in a Home Lab. SAN/NASNow that I've got the storage I can also use volumes for NFS or iSCSI for SAN-like experiments (eg. sharing a raw volume between cluster nodes). I'm using IET (iSCSI Enterprise Target) which so far has worked well. Over the network the 1Gbit Ethernet is definitely the bottleneck so maybe sometime I'll look at a faster options when I start adding additional Hypervisors.
|
|||
Disclaimer: This is a load of random thoughts, ideas and other nonsense and is not intended to be taken seriously. I have no idea what I am doing with most of this so if you are stupid and naive enough to believe any of it, it is your own fault and you can live with the consequences. More importantly this blog may contain substances such as humor which have not yet been approved for human (or machine) consumption and could seriously damage your health if taken seriously. If you still feel the need to litigate (or whatever other legal nonsense people have dreamed up now), then please address all complaints and other stupidity to yourself as you clearly "don't get it".
Copyright Glen Pitt-Pladdy 2008-2023
|
Comments:
Glen,
I'm looking at using lvmcache as well on my host NAS/compile/KVM box and I'm wondering how you've found the performance to be over time? And how has the stability been as well? I'm using Debian Jessie with linux kernel v4.2.6 currently, and just pulled the trigger to buy some new disks and SSDs for the system since it was getting old and disks were starting to fail.
I'm hoping this will work well... but we'll see!
I've not seen any problems yet. Older SSDs did tend to have problems as they wear and have to more aggressively wear-level or operations fail and get retried, but I've not seen any such problems with Samsung 840 EVO drives I've used (and they're fairly old now). Similar happens as magnetic drives start to fail, but I monitor drives closely and all the drives are healthy.
Since this is a Lab system it seems to get brief periods of intensive usage (multiple VMs thrashing) so not like many other applications. I'm also planning to upgrade the SSDs to larger ones to be safe to fit in the hot-spots and further improve practical performance.
Well, after a whole bunch of problems with linux kernel v4.4-rc? I'm finally up and running and the cache looks to be pretty good so far. Pulled five 3.5" disks, added two 4Tb ones and the SSDs. Much quieter system, less heat load and it's faster for sure.
Now to work on monitoring and trying to check performance.
Thanks for your blog, it was a big help.
John
P.S. I need to file a bug report on the man page for lvmcache, it doesn't explicity say up front that all LVs need to be in the same VG for both the source volume and cache LVs.