Data processing: measuring – calibrating – or testing – Measurement system – Performance or efficiency evaluation
Reexamination Certificate
1999-09-30
2004-01-06
Hoff, Marc S. (Department: 2857)
Data processing: measuring, calibrating, or testing
Measurement system
Performance or efficiency evaluation
C700S031000
Reexamination Certificate
active
06675128
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to automating operations in accordance with performance management systems and, more specifically, to facilitating the specification of exceptional conditions in such systems.
BACKGROUND OF THE INVENTION
Approximately 80% of the cost of networked systems can be attributed to management and operations. These costs are related to activities such as distributing and installing software, providing help desk support, and detecting performance and availability problems.
Generally, existing practice for detecting performance and availability problems consists of the following steps: (i) determining a set of metrics to monitor that indicate the presence of problems (e.g., CPU utilization, error rates, transaction request rates); (ii) establishing thresholds on the values of these metrics based on past experience; (iii) using management system software to detect threshold violations; and (iv) responding to threshold violations by taking actions (e.g., adjusting user priorities, restricting the admission of traffic into the network).
So fundamental are these steps to existing practice that the information that drives them is typically externalized as management.“policies.” A policy consists of a metric (or function of multiple metrics), a relational operator that specifies the direction of change in the metric that is undesirable, a threshold value, and an action to take when the threshold is violated.
Typically, policies are expressed as if-then rules. The if-part (or left-hand side, LHS, of the policy) contains a predicate expressed as a bound on one or more metrics. The then-part (or right-hand side, RHS, of the policy) contains the action to take. An example is: “If CPU utilization is greater than 90%, then alarm.” Here, “CPU utilization” is the metric, the relational operator is “greater than,” the threshold value is “90%,” and the action is “alarm” (e.g., send an urgent message to the operations console). The threshold value for alarms may be chosen so that it lies well beyond what is considered normal. We use the term “alarm threshold” for the metric value that, if exceeded, results in either the generation of an alarm or a management action (e.g., terminate a process). Existing approaches check for threshold violations and, when these violations occur, initiate the action specified in the right-hand side of the policy.
In practice, policies have another aspect as well. In order to eliminate transients, it is often the case that the right-hand side of a policy is executed only after the left-hand side of the policy has been satisfied for several successive time intervals. Thus, a common version of the foregoing example is: “If CPU utilization is greater than 90% for three successive time intervals, then alarm.” Thus, embedded within the left-hand side of policies in existing art are higher level policies that determine when the right-hand side should be executed. An example of such a higher level policy is “for three successive time intervals,” as in the foregoing example.
Existing art provides for policy authoring and execution. That is, administrators typically have a graphical user interface through which they specify policy metrics, threshold values, relational operators, and actions. The management system acquires the data necessary to test the left-hand side of a policy and to execute the right-hand side of a policy.
In order to author policies, administrators must specify one or more values for alarm thresholds (e.g., 90% CPU utilization). Doing so can be quite burdensome since the appropriate choice for an alarm threshold depends on factors such as configuration and workload. To complicate matters, workloads are time varying and so the appropriate choice of threshold values is time varying as well.
Researchers have tried to address these difficulties by: (i) computing threshold values from historical data (e.g., J. Buzen and A. Shum, “MASF-Multivariate Adaptive Statistical Filtering,” Proceedings of the Computer Measurement Group, pp. 1-10, 1995; and L. Ho et al., “Adaptive Network/Service Fault Detection in Transaction-Oriented Wide Area Networks,” Integrated Network Management VI, edited by M. Sloman et al., IEEE Publishing, 1999); and (b) developing multivariate tests for network-based problems (e.g., M. Thottan and C. Ji, “Fault Prediction at the Network Layer Using Intelligent Agents,” Integrated Network Management VI, edited by M. Sloman et al., IEEE Publishing, 1999), and automating the updates of threshold values (e.g., Ho et al. article).
Even so, existing art is deficient in two respects: (1) there is no mechanism for automated adaptation of alarm thresholds tested by agents on managed elements; and (2) higher level policies are embedded within the left-hand sides of the policies in existing art and hence changing these policies often requires extensive modifications to the management automation. Note that item (1) requires more than the distribution of new threshold values (e.g., as in Ho et al.). It also requires a means to determine when threshold values should be changed.
In addition to the foregoing, existing art is deficient in the manner in which “warning policies” are handled. Warning policies provide advanced notice of alarm situations so that management staff can detect problems before they lead to widespread disruptions. In the existing art, warning policies are constructed manually by administrators. That is, administrators must specify a set of warning thresholds in addition to the alarm thresholds. Violating a warning threshold causes a message to be sent to the operations staff. Below is an example of a warning threshold for the previously introduced policy for CPU utilization: “If CPU utilization is greater than 80% for three successive time intervals, then warn.”
In existing practice, warning thresholds are specified in the same manner as alarm thresholds. Thus, there is no insight as to when or if the alarm threshold will be violated once a warning threshold is violated. Further, administrators are burdened with specifying still more thresholds.
SUMMARY OF THE INVENTION
The present invention provides methods and apparatus that reduce the burden on administrators for performance management. The methods and apparatus use models of metric values to construct and enforce: (1) alarm policies that adjust automatically to changes, for example, in configuration, topology, and workload; and (2) warning policies based on the probability of violating an alarm policy within a time horizon.
It is to be appreciated that a performance management system of the present invention preferably utilizes forecasting models (e.g., analysis of variance and time series models) to capture non-stationarities (e.g., time-of-day variations) and time-serial dependencies. For example, as described in J. Hellerstein, F. Zhang, and P. Shahabuddin, “An Approach to Predictive Detection for Service Level Management,” Integrated Network Management VI, edited by M. Sloman et al., IEEE Publishing, May 1999, the disclosure of which is incorporated herein by reference, a model in which S(i,j,k,l) is the value of a metric at time of day (i), day of week (j), month (k), and instance (l) may be employed in accordance with the invention. The model is: S(i,j,k,l)=mean+mean_tod(i)+mean_day-of-week(j)+mean_month(k)+e(i,j,k,l). Here, the terms beginning with “mean” are constants that are estimated from the data, and e(i,j,k,l) are the residuals of S. These constants may be estimated using standard statistical techniques such as analysis of variance and least square regression, which are well known in the art. The residuals are identically distributed (stationary), but time serial dependencies may remain. To remove time serial dependencies, a second model may be used: e(t)=a
1
*e(t−1)+a
2
*e(t−2)+y(t), where a
1
and a
2
are constants estimated from the data and the y(t) are independent and identically distributed normal random variables. The y(t) are the result of removing t
Hoff Marc S.
International Business Machines - Corporation
Miller Craig Steven
Ryan & Mason & Lewis, LLP
Zarick Gail H.
LandOfFree
Methods and apparatus for performance management using... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and apparatus for performance management using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for performance management using... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3266418