Electrical computers and digital processing systems: multicomput – Computer network managing – Computer network monitoring
Reexamination Certificate
2000-05-08
2004-09-14
Lim, Krisna (Department: 2153)
Electrical computers and digital processing systems: multicomput
Computer network managing
Computer network monitoring
C714S039000, C714S047300
Reexamination Certificate
active
06792456
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to network and systems management and, more particularly, to detecting and resolving availability and performance problems.
BACKGROUND OF THE INVENTION
With the dramatic decline in the price of hardware and software, the cost of ownership for computing devices is increasingly dominated by network and systems management. Included here are tasks such as establishing configurations, help desk support, distributing software, and ensuring the availability and performance of vital services. The latter is particularly important since inaccessible and/or slow services decrease revenues and degrade productivity.
The first step in managing availability and performance is event management. Almost all computing devices have a capability whereby the onset of an exceptional condition results in the generation of a message so that potential problems are detected before they lead to widespread service degradation. Such exceptional conditions are referred to as “events.” Examples of situations in which events are generated include: unreachable destinations, excessive CPU consumption, and duplicate IP addresses. An event message contains multiple attributes, especially: (a) the source of the event, (b) type of event, and (c) the time at which the event was generated.
Event messages are sent to an “event management system (EMS).” In existing art, such systems are policy-driven, which means that external descriptions are used to specify the event patterns for which actions are taken. Thus, an EMS has separate subsystems for policy execution and policy authoring. The latter provides a means for the operations staff to construct policies. The former provides for the processing of event messages. In existing art, an EMS has repositories for policies, events, and configuration information used in event management.
Upon arrival of an event message, the policy execution system parses the message to translate it into a normalized form (e.g., by isolating fields instead of having a single text string). This normalized information is then placed into an event repository. Next, the normalized event is fed into a “correlation engine” that processes events as specified by operational policies that address considerations such as:
1. Elimination of duplicate messages. Duplicate is interpreted broadly here. For example, if multiple hosts on the same local area network generate a destination unreachable message for the same destination, then the events contain the same information.
2. Maintenance of operationial state. State may be as simple as which devices are up and which are down. It may be more complex as well, especially for devices that have many intermediate states or special kinds of error conditions (e.g., printers).
3. Problem detection. A problem is present if the services cannot be delivered in accordance with a service level agreement (which may be formal or informal). This could be the result of a device failure, exceeding some internal limit (e.g., buffer capacity), or excessive resource demands.
4. Problem isolation. This involves determining the components that are causing the problem. For example, distributing a new release of an application that has software errors can result in problems for all end-users connecting to servers with the updated application.
Items (1) and (2) are, in some sense, intermediate steps to (3) and (4). Thus, we focus on the latter two.
The correlation engine provides automation that is essential for delivering cost effective management of complex computing environments. Existing art provides three kinds of correlation. The first employs operational policies expressed as rules, e.g., K. R. Milliken et al., “YES/MVS and the Automation of Operations for Large Computer Complexes,” IBM Systems Journal, Vol 25, No. 2, 1986. Rules are if-then statements in which the if-part tests the values of attributes of individual events, and the then-part specifies actions to take. An example of such a rule is “If multiple hosts on the same LAN cannot reach the same destination, then alert the operator that there is a connectivity problem from the LAN to the destination.” The industry experience has been that such rules are difficult to construct, especially if they include installation-specific information.
Another approach has been developed by SMARTS, see, e.g., SMARTS, “About Code Book,” http://www.smarts.com/codebook.html, 1999. SMARTS is based on the concept of a codebook that matches a repertoire of known problems with event sequences observed during operation. Here, operational policies are models of problems and symptoms. Thus, accommodating new problems requires properly modeling their symptoms and incorporating their signatures into the code book. In theory, this approach can accommodate installation-specific problems. However, doing so in practice is difficult because of the high level of sophistication required. Further, the SMARTS technology only applies to known problems.
Recently, a third approach to event correlation has been proposed by Computer Associates International, see, e.g., Computer Associates International, “Neugents. The Software that can Think,” Jul. 16, 1999, http://www.cai.com
eugents. This approach trains a neural network to predict future occurrences of events based on the frequency of their occurrence in historical data. Typically, events are specified based on thresholds such as, for example, CPU utilization exceeding 90%. The policy execution system uses the neural network to determine the likelihood of one of the previously specified events occurring at some time in the future. While this technique can provide advanced knowledge of the occurrence of an event, it still requires specifying the events themselves. At a minimum, such a specification requires detailing the following:
1. The variable measured (e.g., CPU utilization);
2. The directional change considered (e.g., too large); and
3. The threshold value (e.g., 90%).
The last item can be obtained automatically from examining representative historical data. Further, graphical user interfaces can provide a means to input the information in items (2) and (3). However, it is often very difficult for installations to choose which variables should be measured and the directional change that constitutes an exceptional situation.
To summarize, existing art uses a micro approach to event correlation. That is, existing correlation engines analyze individual events and their interrelationships. While such an approach has value, it has severe limitations as well. Foremost, existing art requires an expert to develop the operational policies that drive the analysis. As a result, it is difficult for installations to define and maintain customized operational policies.
SUMMARY OF THE INVENTION
The present invention provides systems and methods to simplify and customize the automation of event management. The invention is based on at least the following observation: big problems generate lots of events. This observation suggests a macro approach to event correlation that focuses on the rate at which events are generated rather than their detailed interrelationships.
To illustrate our approach, consider a connectivity problem that occurs between hosts on subnet 82.13.16 and the host 93.16.12.54. Existing art would detect such problems by having rules that examine the event type (“destination unreachable”) and identify that the hosts generating this message are on the same subnet. In contrast, the present invention detects such problems based on the rate at which messages are generated by hosts on the subnet. An event rate threshold is obtained from historical data. If the rate exceeds this threshold, then an alarm is raised. This leads to the rule: “If event rates on a LAN exceed the LAN-specific threshold, raise an alarm.”
Once a problem is detected, event rates provide a way to diagnosis the problem. This is achieved by exploiting the structure of the attributes of events. Consider the example in the preceding paragraph. Once an excessive event ra
Hellerstein Joseph L.
Ma Sheng
Lim Krisna
Perez-Pineiro Rafael
Ryan & Mason & Lewis, LLP
LandOfFree
Systems and methods for authoring and executing operational... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Systems and methods for authoring and executing operational..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Systems and methods for authoring and executing operational... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3271463