How the eG Enterprise to TT System Integration Works?
Alarms in eG Enterprise
To understand how the integration of the eG manager with a trouble ticketing system works, lets first consider what is an alarm. An alarm in eG Enterprise, is identified by an Alarm ID. At any given instant of time, an Alarm ID is a unique combination of the following attributes:
- The problem component-type
- The problem component (i.e., network device, application, etc.)
- The problem layer
- The problem priority (Critical, Major, Minor)
The eG monitoring interface lists alarms that currently exist in the eG Enterprise system. The goal of the eG Enterprise integration with TT systems is to be able to forward updated information on current alarms to the TT system.
Every time there is a state change (e.g., change of priority or correction of a problem) detected in the monitored environment, the eG manager checks the combination of component, component-type, layer, and priority combination for all open problems with their previous values to determine whether a new alarm has been generated, an existing alarm has been modified, or whether an existing alarm has been closed. If a new alarm has been generated, the eG manager assigns a distinct alarm ID for this alarm. If an existing alarm has been modified or closed, the eG manager retains the earlier assigned alarm ID for this alarm. Modification of an alarm can include any of the following cases:
- A change in the alarm priority: This could be a switch to a higher or lower priority.
- A change in the alarm description: For example, originally, a usage-related alarm may have been raised on disk ‘D’ of a server. Later, disk ‘C’ of the same server might have experienced a space crunch, causing another alarm to be raised. In this case, the description of the original alarm will change to indicate that both disks C and D are experiencing a problem, but the alarm ID will not change. Changes in alarm description may also happen if additional tests being run for the same layer indicate a problem. A change may involve either an addition to the description (as in the example above) or a removal of one or more descriptors (e.g., the space usage of disk ‘C’ in the example above returning to a normal condition).
- A change in the list of impacted services
Each alarm is associated with a start date and time. The start date and time signifies when the alarm was first generated by the eG manager. Any change in the state of the alarm during a subsequent time does not cause a change in the start date and time of the alarm. Hence, even if an alarm changes in priority at a later time, its start date and time remain the same, until the alarm is finally closed. When an alarm is closed, a normal alert is generated, which will bear the current date and time.
In order to avoid conflicts/duplication of alarm IDs generated by each of the managers in a redundant eG manager cluster, the alarm ID is expressed as a string that is of the form <eG_Manager>_<numeric_value>, where the <numeric_value> is a timestamp of when the alarm was first generated.
Prior to generating an alarm, the eG managers in a cluster synchronize with each other to ensure that duplicate alarms are not generated or that different alarm IDs are not generated for the same problem. As in the case of email alerts and SNMP traps, each manager in the cluster is responsible for generating alarms for agents that are directly reporting to the manager.
Integration with Trouble Ticketing Systems
The eG manager can be configured so that whenever an alarm undergoes a change - either generation, modification, or closure - the manager communicates this information to a TT system.
This communication can be in any of the forms mentioned below:
- In recent times, many trouble ticketing systems have been found to embed a unique mail interface that receives email alerts of problems in the environment. The eG Enterprise system can be configured to use this interface to send alarms generated by the eG manager as email alerts to the trouble ticketing system. Based on the mails so received, the trouble ticketing system may generate trouble tickets and forward them to the concerned maintenance personnel. For the detailed discussion on this, refer to Integrating the eG Manager with a TT System via a TT Mail Interface chapter.
- The eG manager can also send its alarms as SNMP traps to third-party SNMP management systems. When doing so, you can specifically configure the eG manager to send these traps as trouble tickets to the third-party system. This ensures that every SNMP trap sent by the eG manager is tagged with a unique TT ID, which helps track the status of the problem for which the trap was originally raised. To know how this is done, refer to Trouble Ticket Integration Using SNMP Traps chapter.
- The eG Manager supports a command line interface, that can be configured to automatically execute TT system-specific commands as and when alarms are added, modified, or deleted in eG Enterprise. This interface offers a way of communication between the eG Manager and a TT system. Trouble Ticket Integration Using the eG TT CLI chapter discusses this in great detail.
- The eG Manager can also forward its alarm information to any web services interface that the Trouble Ticketing System may support to trigger the automatic creation/closure (as the case may be) of trouble tickets. See Trouble Ticket Integration Using a Web Services Framework.
Handling eG Alarms in a Trouble Ticketing System
A trouble ticket system must be configured to process alarms reported to it by an eG manager. The alarm ID must be used to uniquely identify an alarm. The functions that the TT system must perform are:
- Determine if an alarm ID indicates a new alarm. If yes, open a new trouble ticket.
- If an alarm ID indicates an existing alarm, check the priority of the alarm. If the priority is Normal, this implies that the alarm has been closed in eG Enterprise. Hence, close the corresponding trouble ticket in the TT system.
- If an alarm ID indicates an existing alarm and the priority of the alarm is not Normal, update the corresponding trouble ticket with the current priority of the alarm and with its current description.
These functions often involve scripting/configurations on the TT system.
Once the above steps are accomplished, by reviewing the status of the trouble tickets, administrators can be immediately aware of the current status of the infrastructure being monitored, without having to login to the eG Enterprise console.
- If a standalone (i.e., non-redundant) eG manager is restarted, all outstanding alarms and hence, all open trouble tickets will be closed. After the restart, if an old problem re-occurs, the restarted manager will assign a new alarm ID to this problem; as a result, new trouble tickets will be opened for such problems.
- In a redundant configuration, when a manager is restarted, it checks if the other manager is available. If the other manager in the cluster is not available, all outstanding alarms will be closed. On the other hand, if the other manager in the cluster is available, then the manager being restarted will synchronize alarm information with the other manager. When it detects a problem, the restarted manager checks to see if the other manager in the cluster has already assigned an alarm ID to this problem. If so, then the restarted manager assigns the same ID to the problem. In such a case, new trouble tickets will not be opened for the existing problems.
- In rare instances, when there are rapid alarm transitions (eg., from critical to normal to critical state) for the same component type-component name-layer-priority combination in a redundant eG manager configuration, the same alarm ID may be re-used to refer to the new alarm.
- The same eG manager can be configured to different modes of integration with a TT system - be it email integration, command line integration, or web services-based integration.
- Since the eG manager forwards the current status of an alarm to the TT system, and since such transmission is done only at periodic intervals, the eG Enterprise-TT system integration does not capture all state transitions in the infrastructure being monitored. For instance, if the MailCheckPeriod setting is 3 mins, an event that happens and gets corrected within 1 min is never captured in the TT system. Consider changing the MailCheckPeriod setting to a lower value (up to 1sec), if you require higher sensitivity in trouble ticket tracking. Obviously, lower the value of the MailCheckPeriod, greater is the overhead on the eG manager.