I was part of an interesting discussion recently about how one measures the scalability of an IT management system. In the olden days, the focus was on fault management and the big challenge that management systems faced was event storms. In those days, SNMP agents were the primary means of monitoring devices and these agents did not have a lot of intelligence. So when a problem happened, the management system would receive events from all the devices in the network and would have to correlate these events to determine where the problem was. The scalability of the management system was measured by the number of network devices and by the number of events that the management system could process.
Today’s scenario is very different:
- First of all, management agents have more intelligence built in and since the focus is on outage avoidance (not problem detection), event storms are not common. Well before a user notices a problem, the management system has already alerted the administrator of the problem.
- Today’s management systems also have to handle heterogeneous network, storage, server, application, and virtualized infrastructure elements. A single server may be hosting multiple applications (e.g., web server, middleware, database all on the same server) that need to be managed. An application may have multiple instances – e.g., in some of our largest installations, there are tens of Oracle database instances and WebLogic application server instances running on the same server. So scalability cannot be measured just based on the number of physical systems monitored but must also take into account the number of application instances being managed.
- Yet another change in today’s scenario is the high percentage of servers that are virtualized – so a single server can be hosting tens to over a hundred virtual machines. So any definition of management system scalability must also include the number of virtual infrastructure elements being monitored.
- Then, the depth of monitoring also comes into play. A tool that offers only minimal coverage, collecting 5-10 metrics per application or virtual machine will obviously be able to support more infrastructure elements than another that offers in-depth analysis and metrics for each infrastructure element. The total number of metrics collected by the management system (irrespective of whether these metrics pertain to the network, server, application or virtualization platform) is a better indicator of management system scalability than the number of servers and applications monitored.
Ultimately though, any discussion on scalability has to be tied to the effectiveness of a management system. For example, some management systems are intelligent enough to collect a smaller subset of metrics periodically and to increase the metrics collected when something unusual is detected. Should such a solution be regarded as being less scalable than one that collects more metrics all the time? Definitely not.
Hence, if you are evaluating management solutions, don’t go by any absolute measure of scalability (number of servers, applications, metrics). Consider scalability of the management system in conjunction with the effectiveness of this system and its total cost of ownership!