Fortunato wails, “For the love of God, Montresor!” Montresor replies, “Yes, for the love of God!”
– Edgar Allen Poe, “The Cask of Amontillado”
To say that I didn’t really want to write this blog post would be a massive understatement. I expect the slings and arrows of every disk and SAN vendor as they want to tell me that their solution doesn’t adhere to what I’m about to share. While I’m not perfect and I’ll welcome constructive and meaningful comments where I’ve either missed an important point or made an error. However, to the disk and SAN vendors who want to say these things don’t apply to them, I respond, “Physics are physics.” This post is designed to help anyone understand the key points of hard drive performance. I’ll skip some non-performance features of hard drives in favor for a common sense approach to talking about how drives perform and how at an enterprise level you can think about getting the right performance for your disk subsystems.
Round and Round We Go (How Disks Work, Rotational Latency, and Track-to-Track Seek Time)
So there are two key pieces to a hard drive when we’re talking performance. First, the platters and second the disk arm. Platters are flat, round, disc shaped and magnetically sensitive. They are spinning on a central spindle. Most drives have several of these but it’s possible to have only one. The information on the hard drive is recorded on to these platters in concentric rings called tracks. Each ring is broken into a set of sectors generally measuring 512 bytes of information. In modern hard drives, the number of sectors varies per track.
Sidebar: Most newer hard drives are moving to 4096 byte sectors instead of 512 byte sectors, however, this is largely not important because operating systems haven’t caught up with this so the drives are emulating 512 byte sectors. For the moment, I’d encourage you to ignore the fact that sector sizes are changing.
This disk arm has on it a set of one or more read/write heads. Each read/write heads read or write information to the platter they’re positioned over. The whole arm swings between the different tracks (concentric rings) and once in position the read/write head waits for the right part of the platter to rotate underneath the head so that it can do its work. This results in two kinds of latency (delay) when trying to read information. First, is the track-to-track seek time. Given you’re at track 1 and you need to reposition to track 3, how long will that take? This is measured in ms. The range here is from 3ms to about 15ms to travel 1/3rd of the distance from the first to last track. For the most part seek time is rarely discussed any longer, though you’ll (sometimes) see it on technical specifications if you look.
The primary way that drives are being sold for a difference in latency is the RPM of the disk platters. It make sense that a drive that’s spinning at 15,000 RPMs takes half the time to get to the same sector on the track as a drive that’s spinning at 7,200 RPM. The average rotational latency for a 15K drive is 2ms. For a 7,200 RPM drive it’s 4.17ms. So you can buy a drive that has a random access time (going to a random spot on the disc to read a sector) of around 5ms and as high as about 20ms. (A 4x difference)
Sidebar: A funny thing to think about is that the amount of data passing under the read/write head on the outer tracks is actually greater than the amount of data passing by on the inner tracks. This is due to constant angular velocity (i.e. RPM) and the relatively larger number of sectors on the outer edges. So technically you get greater throughput from the drive at the end of the drive than the center.
The net of all of this is that when you’re looking for performance look for rotational speed and seek time.
Nearly every hard drive sold today has some level of cache on it. These caches are nearly exclusively read, so that’s the way we’ll talk about them. Caching on the drive does something different than caching on the computer itself. Often drives do what’s known as read-ahead caching – that is they read the next few sectors after the one that you asked for because it’s quick for them to do that. Literally there’s no latency in seeking the disk arm and negligible rotational latency wait time. The expectation is that the computer will ask for multiple sectors in a row –even if those requests aren’t at the drive yet. Similarly, some drives pre-cache, picking up sectors before the one requested which are stored in the cache as well. The idea is that reads tend to happen in clusters so the more you can pick up “free” data off the drive you should do that to minimize the possibility of having to get it later.
Fragmentation (Sequential vs. Random Access)
Before I leave disk access times it’s important to explain what fragmentation is – and why it’s important to overall disk performance. Sequential access of information is relatively quick. The time to swing the drive arm (if necessary) is small, and in most cases the rotational latency is negligible. For two sectors on the same track the time to read may be less than 1ms – and because of read-ahead it’s possible that it’s already cached on the drive so the result may be returned in nanoseconds. So all things being equal accessing information sequentially is much faster.
When you start to access information randomly you’re back in the 5ms to 20ms camp. So it’s five to twenty times slower than sequential access. This is why I call hard drives a pseudo-random access device. They’re not truly good at random, they’re good at sequential and are reasonably good at fast switching. Knowing the performance implications, you want everything to be a sequential instead of a random access to the drive.
There’s a certain amount of necessary random access. You can’t get everything to line up sequentially – particularly given the idea of multiple applications running on the computer. However, one thing you can do is allocate files on the file system in a contiguous way. If every individual file that you need is allocated contiguously, reading each file results in one seek and then a continuous sequential read. That’s good from a performance perspective.
However, files are often not allocated contiguously. This happens particularly when files are initially created at a smaller size and are expanded over time. Each time the file is extended it’s possible that the new space allocation on the disk won’t be continuous with the previous allocation. This is particularly true when there are multiple IO operations happening at the same time and when the file grows over time.
The solution is to use a tool to reorganize files so that each file is one contiguous allocation. There are numerous programs that try to do this by moving files from one spot to another. They work – but only in situations where the file once contiguous are read sequentially. There’s little impact to applications like databases where accesses are inherently random.
State of the Art (Solid State Drives)
The ultimate answer to performance as most folks know are solid state drives (SSD). These drives aren’t mechanical at all. They’re electronics. They’re based on bubble memory – the same sort of approach as we see in USB memory sticks, compact flash, SD cards, etc. Just like these medium SSDs come in varying speed but from the context of this conversation we’re looking at them as high performance devices. They cost as much as 125 times as much as a SATA drive for the same storage. However, where a SATA drive might return 150 IOps, an SSD may return over 8000 IOps.
A large amount of this difference is that there’s a very low switch time to read from one sector of the SSD to another sector. (Really they’re banks but let’s keep calling them sectors for consistency.) Instead of ms, bank switch times are nanoseconds.
Sidebar: Ultimately, SSD is the fastest storage you’re going to get but in addition to the cost issue, there’s one other issue you need to be aware of. SSDs wear out when you write too much data to them. Each bank/cell is designed for a certain number of write operations. Beyond that a write may fail. The drives handle some of this internally, reallocating space into some extra scratch space internally and remapping things – however, ultimately there’s a maximum number of writes that an SSD can be used for. So unlike a hard drive that has a stable mean time between failure (MTBF) given environmental conditions (power and temperature), SSDs longevity is based in part on how write focused the operations are.
Sometimes we talk about the throughput of a device and that can be important with SSD drives since their random access time is so low. High speed SSDs are able to absolutely saturate the bus they’re connected to. However, for the most part the throughput of different hard drives isn’t that different so except for large sustained sequential read or write operations the throughput of a hard drive isn’t a key factor.
Get on the Bus (SATA or SAS)
There are obvious limitations of the transfer bus that have to be considered. The first is the maximum throughput of the bus – 3GB or 6GB or beyond. However, even the interface of the bus makes a difference. SATA has its roots in the old ATA hard drive technology (effectively swapping serial communication for parallel) and while it has added an important command cuing capability, the reigning king for performance is SAS which grew up from the SCSI standards. There are numerous technical capabilities of SAS that outstrip SATA but from a performance perspective SATA drives with the same specifications as a SAS drive should perform similarly.
Redundant Array of Inexpensive Disks (RAID)
Redundant Array of Inexpensive Disks (RAID) was a great idea when it was first conceived. There were several levels that were initially defined but really we’re left with a few key RAID levels today:
- RAID 0 – Striping. No redundancy but allows multiple drives to be treated as one.
- RAID 1 – Mirroring. Complete 1:1 redundancy. Every bit of data is on two drives at the same time.
- RAID 10 – Mirroring and Striping. The benefits of mirroring while striping across multiple drives.
- RAID 5 – Checksum. A mathematical checksum is created across the drives (minus one) and written to the remaining drive. Technically the checksum is spread out across drives so that the checksum isn’t all on one drive.
- RAID 6/DP – Checksum times 2. Same as RAID5 except there are two checksums written. Handles two drives in the stripe failing at the same time.
Snake Oil Note: Some folks talk about RAID 0+1 and RAID 1+0 being different. It’s a technical matter that makes no difference. For me they’re just RAID 10.
So let’s look at what we’re looking at for performance and storage for RAID 10, RAID 5, and RAID 6.
|RAID 10||RAID 5||RAID6|
|Storage Capacity with 4 drives||2 drives||3 drives||2 drives|
|Storage Capacity with 8 drives||4 drives (number of drives/2)||7 drives (number of drives-1)||6 drives (number of drives – 2)|
|Performance Random Write with 4 drives||2 drives||2 drives||1.33 drives|
|Performance Random Read with 4 drives||4 drives||3 drives||2 drives|
|Performance Random Write with 8 drives||4 drives (number of drives/2)||4 drive (two writes/operation)||2.66 drives (three writes/ operation)|
|Performance Random Read with 8 drives||8 drives (number of drives)||7 drives (number of drives-1)||6 drives (number of drives – 2)|
There’s a key to the preceding which is I’m talking about small random writes. RAID 5 will perform at n-1 for writes when they’re large and sequential. I’m also assuming there aren’t any communications or processing issues that interfere with raw disk performance. One other thing that you need to know – and the thing that can have a substantial impact on performance — is the size of the stripe.
Taking Your Stripes (RAID Striping)
RAID 10 can operate at a sector level – that is only writing a sector that’s changed to two drives, although many RAID10 implementations follow the same model as RAID5 and RAID 6. However, RAID5 and RAID6 have to work from the perspective of a stripe of data. That is they collect a set of sectors across the stripe and calculate the checksums based on the stripes. A stripe can be created as small as 2K (4 sectors) but a more typical setting is 64K (128 sectors). The checksums are calculated on this stripe of data and therefore anytime one of the sectors in that stripe changes the entire checksum must be rewritten. With this in mind, let’s look at our example above. To get to 128 sectors of information in a 4 drive RAID5 we’re looking at 43 sectors per drive and a 43 sector checksum. So to change a single bit we have to write 43 sectors of information on the checksum drive. So in that way a write is 44x times less efficient. (44 total writes vs. 1 for a non-protected drive). However, it isn’t really that bad. The primary reason for this is the writes to the checksum drive are sequential. There’s normally no seek time (assuming the sectors are all in the same track) and there’s little rotational latency because the sectors are sequential.
Ways to minimize this are to reduce the stripe size – which increases waste and increases the overhead of calculating the checksum on the controller – but depending on how the drives are being used this may be a good idea. This is particularly true as you start to consider how the drives are being used. Really this impact is seen with write operations – for read operations it’s possible for the controller to just read the disk with the information on it.
Putting Walls Up (Master Boot Record and Partition Tables)
Thus far we’ve been talking about drives from a raw perspective – but that isn’t the way that they’re really accessed by an operating system. There are partitions indicating which parts of a drive – or array of drives – are addressed and how. Operating systems write a partition table to the first few sectors on the drive to indicate what sectors are used for what. In fact, there’s space for what’s called a master boot record – it’s what the computer reads from the drive first and it’s what kick starts the booting process. Both of these make up some space at the beginning of the drive. This is important because it will offset the start of a partition to the start of a stripe on the RAID array. As we’ll see in a second misalignment between stripes and partition boundaries can cause performance issues.
The Right Format (Partition Formatting)
Inside of the partitions created by the partition tables is a format of the disk. Disk formats allow for the organization of files, allocation of space, and so on. FAT – File Allocation Table, NTFS – New Technology File System, EXT3, etc., are all different file systems that are based fundamentally on two key principles: an allocation map and a folder structure. The allocation map simply indicates what sectors are available and which are not. The allocation map is used when files are added to find a spot on the drive which isn’t already consumed. The allocation map is also cleared when a file is deleted. The folder structure starts with a root folder and from there other folders are connected in a hierarchy. The key challenge as it comes to performance is the way the allocation map is handled.
On the one hand you want to be able to use every spare bit on the drive, on the other hand you can’t track the availability of every bit without having another bit for allocation – that’s pretty inefficient. Disks have their own natural sector boundaries at 512 bytes. So one could easily register a bit per sector to indicate whether it’s available or not. However, that would create a very large table for the allocation map – one which would exceed a four byte – and even an 8 byte integer quickly. This makes indexing into and checking the allocation map difficult and inefficient. So in order to manage performance of the allocation map, allocation is handled in clusters. A cluster is a grouping of sectors which will be allocated and deallocated in a block. A typical cluster size for a moderate disk might be 4K. Thus every allocation is for 8 sectors.
So let’s say that our stripe size for our RAID array is 4K – but the alignment of our clusters and our underlying drives isn’t the same. It’s possible for this misalignment to result in a practical performance decrease of 30%. Newer Windows operating system address this during the partitioning process to align the partition boundaries on natural stripe boundaries to resolve the performance issues.
SAN or Direct Attached Storage
From a performance perspective, when you’re comparing a Storage Area Network (SAN) and Direct Attached Storage (DAS), DAS is faster. This isn’t me saying this – take a look at “Which is Faster: SAN or Directly-Attached Storage” However, I need to say that the reason for doing a SAN isn’t for performance. There are many recoverability (i.e. clustering), scalability, and management reasons why you might use a SAN – however, they simply aren’t faster than DAS. (The disclaimer at the top of this article is mostly aimed at this paragraph.) SAN vendors will tell you about their backplanes, switches, controllers, and caching. It’s possible to get faster performance if you change the equation – but this shouldn’t be why you’re looking at a SAN. Doing the same techniques locally on a disk controller will be faster than a SAN.
The only thing a SAN can do is slow down your access to disks because of increased latency, or saturation of a switch, connection, or controller. That is to say that you have more bandwidth available to locally connected disks than to a SAN. If you get to high throughput, you’ll potentially saturate the connection between the SAN and the local system.
I’ll stop here because this is about understanding disk performance not a flame war with SAN vendors.
Fundamentally there are two ways to measure disk performance. The first is to look at the number of Input/Output (Read/Write) operations per second. Known as IOps this is a measurement of how the disks are capable of performing. The other perspective is to measure response time or latency between the request and when it’s serviced. We saw latency in the disks above and we’ll come back to that in a moment, but for now let’s talk about IOps.
Shooting at a Moving Target – IOps
So I can tell you how many IOps a drive array supports – but there’s no way for me to tell you whether that’s sufficient for your application or not – until it’s in production. The problem is that IOps changes day-to-day, hour by hour, and minute by minute. When you have everyone in the organization coming in at the same moment and opening their email, your mail server needs a large number of IOps to respond to the users. However, at 3 AM when all that’s going on is SPAM filtering, there isn’t much of a need for IOps.
There some things you can say in a general sense. Databases require the highest number of IOps. Mail servers need less but still more than file servers. However, even in this there’s a great deal of variation. Some databases are “really hot” – requiring a high number of IOps – and some are “cold” requiring basically no IOps.
So what does that mean? If you’re comparing different configurations, or you’re getting bragging rights with your friends, measure your IOps, if not, they’re not likely to help much.
Are you Ready Now? (Latency)
The real measure of disk performance is: “What is it doing when you’re using whatever system it was intended for?” The measure for this is latency – or the average time it takes to get a response from the disk for a read or write operation. The shorter the time it takes to respond, the better the performance. The longer the time it takes to respond, the worse the performance. It’s really pretty simple. An average between 20ms and 10ms is pretty good. Above that and you may have a problem.
There are some exceptions to this. If you’re doing a disk-to-disk backup your backup disk may see times higher than this – but then again your target disk is really a bottleneck. So you’ve got to use some sanity with these thresholds and not just blindly say that your backup disk having an average response time of 200ms is a major issue.
Sometimes folks will talk to me about counters like the % disk busy that Windows will return as a counter – and my response is that this doesn’t matter. I don’t care whether the disk is busy or not – I care whether the disk is returning results quickly.
Hopefully this gave you enough of a background to understand how disks work and how you can think about creating performance disk subsystems.