System and method for automatic identification of...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S223000

Reexamination Certificate

active

06457143

ABSTRACT:

BACKGROUND
1. Technical Field
The present application relates generally to a system and method for automatically diagnosing information systems that suffer from degradations in performance and service availability and, more particularly, to a system and method for automatically identifying bottleneck resources in a computer network using inference methods for analyzing end-to-end performance data without the need for detailed information about individual resources and processes.
2. Description of Related Art
A critical task in managing distributed applications is the process of diagnosing the cause of degradation in quality of service to end-users. For mission critical applications, the ability to resolve problems in an expedient manner is particularly important. Due to the complexity of distributed applications, problem diagnosis requires skills across multiple disciplines. Unfortunately, problems often get routed to the wrong department or the departments themselves do not agree on who should accept the responsibility. Therefore, it would greatly enhance productivity and reduce the time and cost of problem resolution if the scope of the problem could be automatically isolated to a small subset of bottleneck resources.
The process of identifying bottlenecks, however, is a difficult task when a large number of resources are involved. This is indeed the case even for a simple business transaction such as purchasing items over the Internet. Indeed, the application supporting the business transaction typically requires services from multiple servers. The servers may include name servers, proxy servers, web servers, mail servers, database servers, etc. In addition, the underlying application may require the service of connectivity resources to transfer data between the user machines and the servers as well as among the server machines. These connectivity resources typically include routers, switches, gateways and network bandwidths. Moreover, at the software level, the application may require services from various functional components such as file systems, directories, communication facilities, transaction middleware and databases.
Conventional approaches to problem determination include monitoring detailed metrics from individual resources. For instance, counters and meters are instrumented into various hardware and software entities to measure utilization, contention, data rates, error rates, etc. These metrics reveal the internal workings of each component. If any metric exceeds its predefined threshold value, an alarm is generated.
There are various disadvantages to using the conventional diagnostic approach. One disadvantage is that the method requires the constant monitoring of multiple metrics of potential bottleneck resources, thereby generating a large data volume and traffic and imposing an excessive workload on the information analysis system. Another disadvantage is that resource metrics may carry a large amount of redundant information, as well as apparently conflicting information. In addition, the metrics cannot reveal all possible problems.
Another disadvantage is that an excessive value of a particular resource metric at any point in time does not necessarily imply a bottleneck condition because the adverse effect of one metric is often compensated by the favorable conditions of other metrics. Indeed, in systems having built-in redundancy (e.g. alternate paths), the deficiency of one resource instance can also be absorbed by other resource instances, thereby reducing the impact on overall performance due to the temporary local anomaly. Consequently, the extensive monitoring of individual resources tends to generate large amounts of false alarms. Therefore, the aforementioned disadvantages associated with the quantitative resource metric approach may lead to scalability and accuracy problems in bottleneck identification using resource metrics from medium to large enterprises.
Another conventional technique for diagnosing problems is referred to as the “event-based” method, which involves correlating events or alarms from resources. In particular, this method involves detecting “patterns” in an event stream, where a “pattern” is generally defined as the occurrence of related events in close proximity of time. With the event-based approach, events that are part of any recognizable pattern are considered to be part of an event group. Each pattern has a leading event and the resource that originates the leading event is considered the root cause of other events in the group and the root cause of the problem associated with the pattern.
The effectiveness of the event-based approach is limited to problems arising from serious failures and malfunctioning for which explicit alarm mechanisms have been instrumented. Other disadvantages to the event-based approach is that it requires the analysis of large amounts of event or alarm data from each resource. As such, it suffers from the same scalability and accuracy problems as the resource metric based approach.
Another conventional method for identifying bottlenecks places emphasis on collecting quality of service data such as end-to-end response times and end-to-end availability measures. This data is effective for detecting problems from the end-user's perspective and provides a valid basis for suspecting that a bottleneck condition exists. The end-to-end data by itself, however, does not exactly identify the bottleneck resource. Indeed, this approach cannot be used for diagnosis in the absence of intelligent interpretations by human experts.
To overcome this problem, a more direct conventional approach involves producing a detailed breakdown of the end-to-end data into components. The component with the largest response time is deemed a bottleneck that causes problems in end-to-end response time. Unfortunately, such component level data is not always readily available from most network and server products deployed in a network configuration. Moreover, a detailed response time decomposition process requires instrumentation at each network or server resource. It often requires modifications to the application, the middleware and software modules running in the network devices.
For certain network protocols, a trace analysis approach may be used wherein response time components can be deduced from traces of low-level events by recognizing the time instants when a request or reply is sent or received by a host. Again, the analysis of protocol traces involves a great deal of reverse engineering and guess work to correlate events because the beginning and the end of each response time component is not always clearly demarcated in the trace. In addition, trace analysis poses a great challenge when the data over the network is encrypted for security reasons since the data necessary for correlation is not visible. On top of all these issues, the decomposition approach runs into scalability problems because large amounts of data have to be collected and correlated at the per resource level. As a result, the trace-based decomposition approach is used mostly for application debugging during the development stage and is not recommended for regular quality of service management after the deployment of the application.
Accordingly, a simplified system and method that provides automatic identification bottleneck resources in a computer network is highly desirable. A simplified bottleneck identification process should use only end-to-end quality of service data and eliminate the need for monitoring detailed internal resource metrics, monitoring and correlating events from resources, and measuring or estimating component response times, such as required by conventional techniques.
SUMMARY OF THE INVENTION
The present invention is directed to a system and method for providing automated bottleneck identification in networks and networked application environments by processing using end-to-end quality of service measurements in combination with knowledge of internal resource dependency information generated by a network administrator. Ad

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for automatic identification of... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for automatic identification of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for automatic identification of... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2878693

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.