Welcome back to the ongoing series of blog posts on benchmarking storage pods! Today is another beautiful Thursday and we have some extra information for you.
In my first blog post in this series, I had just barely gotten my hands on a Storage Pod -- and I was out to set a baseline for the storage performance. I mentioned that our intention had been to use SSDs for really fast access, and bulk SATA drives for the massive amounts of storage. I may also have mentioned that the controllers were seemingly unfairly balanced. More details of course are in part I.
First of all, I have to send a big thank you to our vendor, who almost immediately responded with some quick tips, clearly showing that the Storage Pod crowd of 45drives.com is paying attention, and wants you to get the loudest bang for your buck. Much appreciated, and nothing but kudos!
Now, admittedly, I don't know all that much about hardware -- plenty of people are much better designated experts. I do appreciate the effects of (a bit of, pun intended) electrical interference on a 4-layer mainboard PCB under high frequency throughput though, as well as the accuracy of x86 CPU architectures vs. let's say, s390(x). That is to say, I'm completely out of my comfort zone when I ask for advice in a computer shop, because I have to dumb it down (a lot), not even expecting an answer that helps me.
That said, loud the bang will be, because 45 readily available SATA drives of 4 TB each give you a raw storage volume of 180 TB, regardless of what you use it for. At the cost of the chassis and the cost of the individual drives, all put together, you will elect to put both performance and data redundancy in the hands of replicating and balancing entire pods rather than using multiple 3U storage arrays from the usual vendors.
However, naturally you will still want a single storage pod to perform well. You will also want each Storage Pod to be some level of redundant in and by itself, to stop you from having to go in to the datacenter with a few drives too often.
I talked about the initial synchronization of the two RAID-5 arrays I created taking a little while at a sustained throughput of about ~75 - ~90 MBps. I'm fairly certain you do already appreciate, but I have to mention anyway, that this is an implied throughput per disk, rather than for the array as a whole. It is therefore not a baseline, but it does give us information; If each disk can do ~85 MBps, then 19 such disks should (theoretically) be able to accumatively achieve a total throughput of just under 1.6 GBps, right? Right, but also wrong. The I/O pattern of a RAID resync is completely different from the I/O pattern of regular use. Luckily though, each SATA backplane is capable of sustaining 5 GBps, so we're also not maxing out the backplane.
This gives us two interesting paths to explore;
- Does "md126" (with 20 partipant, active drives) outperform "md128" (with 17 participant, active drives)?
- Does substituting the SATA drives for SSDs mean anything (on Highpoint Rocket 370 controllers)?
In this blog post, I will likely only get around to answering question #1 -- sorry to dissappoint.
We first seek a baseline comparison on the current situation -- remember 20 disks on one array, 17 on the other, RAID 5, each has two hot spares (not included in the count).
Q: What shall be your benchmark?
The answer is simple: Bonnie++.
Q: How shall you run it?
This is a simple one too, but the answer is a little longer. One opts for a bunch of defaults without tweaking at first. This shall be your baseline.
In our particular scenario, running Red Hat Enterprise Linux 7 on these Storage Pods, the default filesystem is XFS. I come from an era in which this was not the default, however, and we want to compare "now" vs. "then" -- meaning we'll start out with EXT4.
The choice of filesystem, I think, has logical implications on the performance, even in a Bonnie++ scenario (with a default chunk size of just under 16GB). I'm taking in to account things like journalling, and there may also be points where one filesystem driver is inclined to hook in to kernel-level storage interfaces slightly different from another filesystem driver.
Hence we have set the scenario: We'll want a genuine comparison of different filesystems for Bonnie++ to benchmark using the two different RAID 5 arrays.
Q: Proper scientists formulate hypothesis before they start running in circles, so what's yours?
The hypothesis is that md126 (20 disks) will outperform md128 (17 disks), and XFS will outperform EXT4 but not by as much as the md126 array will outperform md128.
I feel inclined to acknowledge the following assumptions with regards to this hypothesis, since you started to go all scientist on me;
- XFS surely isn't equal to or more than 118% more efficient than EXT4 at the pattern we're about to throw at it, and
- whatever pattern you throw at it when benchmarking very rarely represents the actual patterns thrown at it in a production environment.
First Things First
- Reading the output from Bonnie++, and secondly interpreting what it means, is a skill in and by itself -- I know because I've learned the hard way. More on this later.
- We're continuously running Munin on the "localhost" -- with its default 5 minute interval, and a non-CGI HTML and graph strategy. This means as much as that every 5 minutes, Munin is eating CPU and I/O (though not on the same disks), and therefore we have to repeat runs of Bonnie++ 10 times, in order to get results that better represent actual as opposed to ficticious througput.
- Bonnie++ is run with only a -d /path/to/mountpoint command-line, with /path/to/mountpoint being a given Logical Volume we use for the specific test. That is to say that each RAID array has been made a Physical Volume, each PV has been added to a Volume Group unique to that PV, and test has a Logical Volume (of a set, constant size) in the appropriate VG.
Recognizing and acknowledging the I/O pattern likely to be thrown at the Storage Pod helps in determining the type of benchmark you will want to run. After all, a virtualization guest's node disk image -- despite the contents inside that disk -- does establish a pattern slightly different from an IMAP spool or even a number of package build roots / docker images.
To obtain this information, let's see what a guest's disk image tends to do; internally to the guest, a disk is partitioned and filesystems be mounted. Data is read from and written to places in this filesystem, but underneath it all may be something like a qcow2 thin-provisioned image file. This basically means that the I/O pattern be random, yet -- for most tech running inside your vm -- of a block-stream level.
An RPM installation however uses cpio, and a build root tends to install many RPMs -- much like a yum update does, or the yum installs for a Docker container. This particular I/O pattern tends to be supremely expensive on the disk / array controller -- which is why most build roots are in (the in-memory) tmpfs.
Long story short, whether a Storage Pods' individual controllers need to maintain mirrors of entire RAID arrays, whether RAID arrays are themselves (a level of) robust, what your expectations are of the individual disks and how many Storage Pods you planned on including in your environment, as well as the particular technology you're thinking of using to communicate with your Storage Pods -- software iSCSI (tgtd)? NFS? GlusterFS (Over NFS? Replicated? Distributed? Both)? -- all of it matters. This is subject to requirement engineering, backed by large amounts of expertise, experience, information, skill and proper judgement.
How About that 20-Disk vs. 17-Disk Array?
Well, luckily one part of the hypothesis turns out to be true: The "md126" array (20 disks) outperforms the "md128" array (17 disks). Or does it?
However, it only does so using a particular I/O pattern that Bonnie++ tests:
||K putc() per second
When you run Bonnie++, it will default to doing 16 GB worth of putc() calls, or otherwise individual bits and bytes, or 16 billion in total or so -- give or take a dozen. What Bonnie++ has reported here is an average of 10 individual runs, where md126 averages out putting through 894.4K putc() calls a second, and md128 more -- 898.1K per second to be precise.
Let us back up, and see if these numbers somewhat represent what we find in real life;
Some ~900k calls per second, with ~16 billion or so in total, would mean that the test takes ~5 minutes to complete, right? Check, it does take that approximately that long.
So md128 is the clear winner, with more putc() calls per second, and lower CPU usage. Huh?
Bonnie++ "efficient block write"
||K per second
Here, what Bonnie++ calls "efficient block writes" achieve a significantly higher rate for the md126 array than for the md128 array. However, when you calculate back the amount of CPU usage involved, md128 again outperforms md126 in efficiency (at 15k per %1 vs. 14k per 1%).
What are "efficient block writes"? I'm more than happy to admit I cannot say for sure. Deriving from the context I imagine efficient block writes have to do with the I/O scheduler in the kernel and subsequent subsystems. This tells me I will want to perform the same tests using different kernel I/O schedulers for the set of individual block devices in each array. Noted.
Note however these are ext4 based tests. We have the XFS tests to go still:
||K putc() per second
This brings us to the second part of the hypothesis -- XFS outperforming EXT4, but by no more than 117%. Well, as far as this benchmark goes, it is busted. Let's look at the "efficient block write" stats:
Bonnie++ "efficient block writes"
||K per second
Not that much gain in throughput, although apparently XFS is slightly faster than EXT4, but what a decrease in CPU usage -- XFS seems much more efficient at around and about a rate of some 30%!
I hope you enjoy reading about Storage Pods and the novice level approach I'm taking to try and comprehensively benchmark them, using exclusively Free and Open Source Software. That said, your feedback and ideas on things to also try is much appreciated! Please do not hesitate and call out firstname.lastname@example.org, or hit me on Twitter (@kanarip).
While it is not part of this series of blog posts about my attempts to get the loudest bang for my buck, we do appreciate collaboration. I've consistently collected the raw statistics and reports, and I'm fully aware that the aforementioned numbers to not mean anything without them. Please do not hesitate and contact Kolab Systems if you are interested in reviewing the raw data yourself.
I also appreciate your feedback on how you think multiple Storage Pods would fit in your infrastructure. Are you considering NFS servers for a virtualization environment, perhaps replicated through DRBD, or take a Ceph/GlusterFS approach? How do you think the concept of shared storage would fit in with future technologies such as Docker/Atomic? Hit me at email@example.com or Twitter.