Performance monitoring and tuning are topics which most professionals know or care little about – until performance becomes a problem. It’s one of the topics that doesn’t come up frequently enough to drive a lot of interest in understanding how it works, how it fits together, and what to do about it. However, there is value in understanding the fundamentals of performance monitoring with Windows-based system and what to do based on what you find.
Searching Out the Bottlenecks
The primary activity in performance monitoring is seeking to understand the bottlenecks in the system which either is already causing performance issues or have the potential to cause performance issues in the future. Because we’re seeking out bottlenecks we’re looking – primarily – for metrics and counters which are able to tell us the relative amount of capacity that has been used and how much is remaining.
Because of this the performance metrics that we’re going to gravitate to are those which are expressed as a percentage. The percent of disk time, percent of CPU time, and percent of network usage are good examples of the kinds of metrics that we’ll want to focus on when evaluating performance at a macro level. They are not, however, an exhaustive list of metrics. They are only the metrics that are easiest to understand and extract value from quickly.
Spikes and Sustained
Even with counters that report status on a percentage of available resources there are still challenges to face. The first challenge is determining when there’s a problem because of a sustained lack of available resources and when it’s a momentary blip on the radar.
The primary consideration in performance monitoring is over what interval of time can you accept performance challenges? What level of performance is acceptable and which is not? Is it important that the CPU have some availability every second? In most cases the answer to that question is no. However, the question becomes more difficult as you ask the question over a one minute interval. Most users tolerate the occasional slow down that is over within a minute. However, hours of performance problems are a different story.
So when evaluating what is a performance problem and what isn’t a performance problem consider how long your users would be willing to accept a slow down and then ignore or temper your response to momentary spikes in whatever counter you’re looking at. Momentary spikes are a normal occurrence and simply mean that the system is pouring all of its resources into fulfilling the requests that the users have made.
Objects, Counters, and Instances
Performance monitoring on a Windows system requires an understanding of the way that Windows breaks down counters. On a Windows system performance monitoring starts with an object. An object is a broad representation of something, such as memory. This broad topic groups a set of related counters. Each counter is an individual measure in that category. For the memory object, page faults/sec, pages/sec, and committed bytes are all examples of counters. Each counter may measure the object in a different way but all of them relate to the object to which the counter belongs.
For each counter there may be multiple instances. An instance is a copy of the counter for a different part of the system. For instance, if a system has two processors, the counter for % processor time will have three instances; one for each processor and one for a total (or average) between the two processors. In most cases each instance needs to be viewed separately from the others to identify specific areas where problems may occur.
You’ll find that for most purposes there are only four areas of performance monitoring that you care about. They are: memory, disk, processor, and network. These are the key metrics because they are the core system components that are most likely to be the source of the bottlenecks.
One of the challenges in performance monitoring is the interdependence of these key subsystems on one another. A bottleneck in one area can quickly become a bottleneck in another area. Thus the order which you evaluate the performance of these subsystems is important to reaching the right conclusion.
The first characteristic to evaluate is the memory characteristic because it has the greatest potential to impact the other metrics. Memory will, in fact, often show up as a disk performance problem. Sometimes this disk problem will often become apparent before the memory issue is fully understood.
In today’s operating systems when memory is exhausted the hard disk is used as a substitute. This is a great idea since hard drives are substantially larger than memory on a server. However, it has the disadvantage that hard drives are orders of magnitude slower than memory. As a result what might be a relatively light load on memory will quickly tax a hard disk and bring both the disk and the system to its knees.
One way to mitigate this is to minimize, or eliminate the virtual memory settings in Windows to prevent Windows from using the hard drive as if it were memory. This setting can prevent a memory bottleneck from impacting the hard drives – but raises the potential for the programs running on the server to not be able to get the memory that they need. This is generally an acceptable balance for making sure that you’re aware of the true root cause of an issue.
The memory counter to watch is the pages per second (pages/sec) counter. This counter tracks the number of times that data was requested from memory but it had to actually be read from disk. This counter, above all others, helps to identify when the available memory doesn’t meet the demands of the system. A small number, say less than 100, of these is a normal consequence of a system which is running, however, sustained numbers larger than 100 may indicate a need to add more memory. If you’re seeing a situation where you need more memory you can not evaluate the disk performance reliably since the system will be using the disk to accommodate the shortage of memory.
The primary counter for monitoring disk time is the ‘% Disk Time’ counter. This counter represents the average number of pending disk requests to a disk for the interval multiplied by 100 (to get a percentage.) This calculation method leads to some confusion when the disk driver can accept multiple concurrent requests such as SCSI and Fibre Channel disks. It is possible for the instances measuring these types of disks to have a % disk time above 100%.
One of the choices to be made when selecting disk counters is whether to select Logical disk counters or Physical disk counters. Logical disk counters measure the performance relative to the partition or logical volume rather than by the physical disks involved. In other words, Logical disk counters are based on drive letter mappings rather than on the disks involved. The physical disk option shows instances for each of the hard drives that the operating system sees. These may either be physical drives or in the case of RAID controllers and SAN devices, the logical representation of the physical drive.
In general, the way that you’ll be measuring performance for disk drives the best approach is to use physical disk counters. This will allow you to see which hard disk drives are busier and which ones are not. Of course, if there’s a one-to-one relationship between your logical drives (partitions) and the physical drives (that the operating system sees) then either logical or physical disk counters are fine. However, only the physical disk counters are turned on by default. If you decide to use logical disk counters, you’ll need to run the DISKPERF command to enable logical disk counters, and reboot the system.
The % disk usage counter should be evaluated from the perspective of how long of a performance slow down you can tolerate. In most cases, disk performance is the easiest to fix – by adding additional drives. So it’s an easy target if you’re seeing sustained % disk times above 100%. If you’re on a RAID array or a SAN consider that you may want to be evaluating the % disk times from 100 % times the number of effective drives in the array. For RAID 1 and RAID 1+0, it’s one half the number of disks. For RAID 5, it’s the number of disks minus one.
Since the dawn of computing, people have been watching processing time and the processes which are consuming it. We’ve all seen the performance graphs that are created by task manager and watched in amazement at the jagged mountain range that it creates. Despite the emphasis on processor time for overall performance it’s one of the last indicators to review for performance bottlenecks. This is because it’s rarely the core problem facing computers today. In some scientific applications and others with intense processing requirements it may truly be the bottleneck – however, everyone seems to know what applications those are. For most applications processor speed just isn’t the key issue.
The most useful measure of a processor’s availability is the % processor time. This will indicate the percentage of time that the processor (or processors) were consumed. This is useful because taken over a period of time it indicates the average amount of capacity that is left.
Improving processing speed isn’t an option for most servers. The application will need to be split up, optimized, or a new server installed to replace the existing one. It is for this reason that when processing bottlenecks occur they are some of the most expensive to address.
Until recently not much thought was given to the network as a potential bottleneck but with the advent of super-sized servers with four or more processors and terabytes of disk space it has to be a consideration. Network performance monitoring is a measure of how much of the bandwidth available on the networks is actually being consumed.
This is a tricky proposition since the connected network speed may not be the total effective speed. For instance, a super-server is connected through a 1GB connection to a switch which has eight 100 MB connections. The server will assume that 1GB of data can flow through the network that it is connected to. However, in reality only 800 MB at the most is truly available to be consumed.
Another consideration is that many network drivers even today are less than stellar in their reporting performance information. More than a few network card drivers have failed to properly report what they’re doing.
In general, network performance monitoring should be done from the perspective of understanding whether it is a possible bottleneck by evaluating what the maximum effective throughput of the network is likely to be and determining what that percentage of the theoretical limit is. In general it is reasonable to assume a 60% utilization rate for Ethernet is all that is really possible.
Resolving the Details
The guidelines here may not be enough to completely diagnose a performance problem and identify a specific course of action to resolve it, however, in many cases it will be. In those cases where it’s not clear enough to be resolved by looking at the high level indicators that were mentioned here, you’ll have to dive through the other counters and identify which ones will help you isolate the problem and illuminate potential solutions.