Fault monitor for restarting failed instances of the fault...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S047300

Reexamination Certificate

active

06718486

ABSTRACT:

COPYRIGHT NOTICE
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION
The present invention relates to the field of electronic commerce (e-commerce) and particularly to electronic systems in capital markets and other e-commerce applications with high availability and scalability requirements.
Historically, mission critical applications have been written for and deployed on large mainframes, typically with built-in (hardware) or low-level operating system (software) fault-tolerance. In some prior art, such fault-tolerance mechanisms include schemes where multiple central processing units (CPUs) redundantly compute each operation and the results are used using a vote (in the case of three-way or more redundancy) or other logical comparisons of the redundant outcomes in order to detect and avoid failures. In some cases a fault-stop behavior is implemented where it is preferred to stop and not execute a program operation when an error or other undesired condition will result. This fault-stop operation helps to minimize the propagation of errors to other parts of the system. In other implementations, elaborate fault recovery mechanisms are implemented. These mechanisms typically only recover hardware failures since application failures tend to be specific to the particular application software. To detect errors in application software, vast amounts of error-handling code have been required. Certain financial applications have devoted as much as 90% to error detection and correction. Because of the enormous complexity of such software applications, it is nearly impossible to entirely eliminate failures that prevent the attainment of reliable and continuous operation.
Increasingly, systems need to be available on a continuous basis, 24 hours per day, 7 days per week (24/7 operation). In such nonstop environments it is undesirable for a system to be unavailable when system components are being replaced or software and hardware failures are detected. In addition, today's applications must scale to increasing user demands that in many cases exceed the processing capabilities of a single computer, regardless of size from small to mainframe. When the system load cannot be handled on a single machine, it has been difficult and costly to obtain a larger machine and move the application to the larger machine without downtime. Attempts to distribute work over two or more self-contained machines is often difficult because the software typically has not been written to support distributed computations.
For these reasons, the need for computational clusters has increased. In computational clusters, multiple self-contained nodes are used to collaboratively run applications. Such applications are specifically written to run on clusters from the outset and once written for clusters, applications can run on any configuration of clustered machines from low-end machines to high-end machines and any combination thereof. When demand increases, the demand is easily satisfied by adding more nodes. The newly added nodes can utilize the latest generation of hardware and operating systems without requiring the elimination or upgrading of older nodes. In other words, clusters tend to scale up seamlessly while riding the technology curve represented in new hardware and operating systems. Availability of the overall system is enhanced when cluster applications are written so as not to depend on any single resource in the cluster. As resources are added to or removed from a cluster, applications are dynamically rescheduled to redistribute the workload. Even in the case where a significant portion of the cluster is down for service, the application can continue to run on the remaining portion of the cluster. This continued operation has significant advantages particularly when employed to implement a cluster-based component architecture of the type described in the above-identified cross-referenced application entitled MARKET ENGINES HAVING EXTENDABLE COMPONENT ARCHITECTURE.
While clustering technology shows promise at overcoming problems of existing systems, there exists a need for practical clustering systems. In practical clustering systems, it is undesirable for each application in a cluster system to manage its own resources. First, it is inefficient to have each application solve the same resource management problems. Second, scheduling for conflict resolution and load-balancing (which is important for scalability) is more effectively solved by a common flexible (extensible) resource manager that solves the common problem once, instead of solving the problem specifically for each application. Furthermore, failure states tend to be complex when each application behaves differently as a result of failures and with such differences, it is almost impossible to model the impact of such failures from application to application running on the cluster. To overcome these problems, commercial and academic projects have arisen with the objective of providing a clustering architecture that provides isolation between physical systems and the applications they execute.
To date, however, proposed clustering architectures are complex and can only handle a limited number of specific system failures. In addition, proposed clustering software does not appropriately scale up across multiple sites. There is a need, therefore, for a simple and elegant clustering architecture that includes fault-tolerance and load-balancing, that is extendable over many computer systems and that has a flexible interface for applications. In such an architecture, the number of failure states needs to be kept low so that extensive testing is possible to render the system more predictability. Hardware as well as software failures need to be detected and resources need to be rescheduled automatically, both locally as well as remotely. Rescheduling needs to occur when a particular application or resource is in high demand. However, rescheduling should be avoided when unnecessary because rescheduling can degrade application performance. When possible, rescheduling should only occur in response to resource shortages or to avoid near-term anticipated shortages. If the system determines that resource requirements are likely to soon exceed the capacity of a system element, then the software might appropriately reschedule to avoid a sudden near-term crunch. The result of this “anticipatory” rescheduling is avoidance of resource bottlenecks and thereby improvement in overall application performance. The addition and removal of components and resources needs to occur seamlessly in the system.
In view of the above background, it's an object of the present invention to provide an improved fault-tolerance framework for an extendable computer architecture.
SUMMARY
The present invention is computer system having a fault-tolerance framework in an extendable computer architecture. The computer system is formed of clusters of nodes where each node includes computer hardware and operating system software for executing jobs that implement the services provided by the computer system. Jobs are distributed across the nodes under control of a hierarchical resource management unit. The resource management unit includes hierarchical monitors that monitor and control the allocation of resources.
In the resource management unit, a first monitor, at a first level, monitors and allocates elements below the first level. A second monitor, at a second level, monitors and allocates elements at the first level. The framework is extendable from the hierarchy of the first and second levels to higher levels where monitors at higher levels each monitor lower-level elements in a hierarchical tree. If a failure occurs down the hierarchy, a hi

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Fault monitor for restarting failed instances of the fault... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Fault monitor for restarting failed instances of the fault..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fault monitor for restarting failed instances of the fault... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3237267

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.