Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-11-19
2002-04-09
Iqbal, Nadeem (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C709S201000
Reexamination Certificate
active
06370656
ABSTRACT:
CROSS-REFERENCE TO RELATED APPLICATIONS
Not applicable.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
Not applicable.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to fault tolerance in microcomputer systems, and in particular to computer systems adapted to periodically check for failures. More particularly, the present invention relates to personal computer system capable of transmitting and receiving heartbeat messages at an adjustable rate for improved fault tolerance.
2. Background of the Invention
Although early microcomputers were popular with hobbyists for such computing tasks such as word processing and video games, early microcomputer systems did not match the superior data processing speed of larger mainframes and minicomputers. Consequently, most businesses and organizations that required a high level of data processing and communications, including financial, academic, and scientific institutions, traditionally relied on networks of mainframes and minicomputers for computing tasks. In recent years, microcomputers, which may be generally defined as microprocessor-based, programmable electronic devices for retrieving, storing, and processing data, have developed rapidly in terms of processor speed, memory speed and capacity, and interconnectability. As microcomputing capabilities approach those of mainframes and minicomputers, networks of personal computer systems increasingly are utilized for the heavy data processing and communications jobs once handled by the larger machines.
Because of the sheer amount of data that must be processed by some organizations (e.g., financial and research institutions) and also the sensitivity of some data to computer system faults (such as air traffic control data and banking transactions), mainframe computers usually have incorporated measures to ensure fault tolerance, or the capability of a computer system or network of computers to continue operating even if an internal hardware or software failure occurs. Hence, fault tolerant systems are designed to operate essentially without interruptions. One method of providing fault tolerance is to combine a primary computer system with a backup system. A backup system generally waits in a standby mode without processing data until the primary system fails. When the primary system fails, the backup system replaces the primary system. The calculations of the primary system can thus be continued by the backup system, albeit with a slight interruption before the backup system is activated. Another fault tolerance scheme involves combining two “redundant” computer systems which process the same data concurrently. If one of the systems fails, then the data may still be processed by the working system. A major drawback to redundant systems is their significant expense, due to the fact that two or more data processing systems are required instead of just one. In one type of hybrid system, two or more computers operate independently, processing different data but attached to a common network. When a computer fails, the failed machine is disabled and the remaining computers on the network embrace the workload of the failed computer.
Because the cost of a typical microcomputer (or “personal computer”) has remained well below the cost of a typical mainframe even as personal computing capabilities have soared, it has become increasingly cost effective to use personal computer (PC) systems for tasks that were once reserved only for mainframes. In addition, PC manufacturers have encouraged using personal computers for these tasks by introducing fault tolerance mechanisms into some recent computer designs. Fault tolerant PC networks have been introduced, as well. Personal computer networks generally include one or more personal computers configured as network servers which manage the network and the transfer and storage of data within the network. Network servers generally comprise an abundance of resources, including one or more very fast processors, a large amount of random access memory (RAM), and an abundance of disk storage space. Further, network servers typically operate at fast input/output (I/O) speeds and are given more frequent access to the network than are other computers on the network. The abundance of resources and increased network access allow each network server to transfer files and data efficiently to a large number of networked computers. Because a single failure in a network server may cause network problems or even downtime to many computer users, fault tolerant network servers generally have benefited network performance and have helped to minimize network downtime.
In one network fault tolerance scheme, two servers operate independently of each other but are capable of handling an increased workload if one of the servers fails. In such a scheme, a first server periodically transmits a “heartbeat” message over the network to a second server to indicate that the first server is functioning properly. If the second server does not receive the heartbeat message within a predetermined time interval, then the second server concludes that the first server has failed and seizes the workload of the first server. The second server also transmits a periodic heartbeat message to the first server, so that the first server may process data in place of the second server if the second server fails. Thus, each server essentially provides backup support for the other server in case of a server failure. The heartbeats typically are transmitted infrequently in order to minimize the level of network traffic.
One problem with the heartbeat scheme is that because the heartbeat messages are transmitted at fixed time intervals (or “heartbeat periods”), the heartbeat scheme may be unsuitable for networks which cannot permit downtime greater than one heartbeat period. For instance, if one server fails immediately after transmitting a heartbeat, then it will take almost one full heartbeat period before the second server detects and corrects for the failure. In some sensitive networks, such excessive downtime conceivably could severely degrade network service, cause network instability, or even result in human catastrophe if the network is involved in transportation or safety systems. Conversely, systems needing only a moderate level of fault tolerance might not require a frequent heartbeat. Because all messages sent over a network require some amount of network capacity (or “bandwidth”), a network server transmitting heartbeats at a high rate may absorb large amounts of network bandwidth. Thus, the optimum heart rate may vary according to the type of information being processed and the processing speed. Because it is difficult to design a one-size-fits-all heartbeat scheme, such methods often are not well-suited for a wide range of user applications.
While conventional heartbeat schemes are capable of monitoring whether or not a computer system has failed, these methods usually do not help to predict when failures might occur. If computer failures could be predicted before happening, then corrective actions could be taken as soon as possible to prevent or minimize system downtime. Current heartbeat schemes fail to incorporate prediction measures, however.
Thus, there remains a need for a flexible and responsive fault tolerance scheme capable of determining as well as predicting system performance. Such a scheme preferably would be able to intelligently optimize the heart rate to improve response time during a system failure. Despite the apparent advantages of such a system, to date no one has devised a computer system that offers these benefits.
SUMMARY OF THE INVENTION
Accordingly, the present invention discloses a computer system comprising two central processing units (CPUs), a bridge logic device coupled to the CPUs, and a network interface card (NIC) coupled to the bridge logic, each device transmitting variable-rate heartbeats to a heartbeat monitor. The computer system further includes a main memory device coupled to the bridge logic.
Jenne John E.
Olarig Sompong P.
Compaq Information Technologies, Group L. P.
Conley & Rose & Tayon P.C.
Harris Jonathan M.
Heim Michael F.
Iqbal Nadeem
LandOfFree
Computer system with adaptive heartbeat does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Computer system with adaptive heartbeat, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Computer system with adaptive heartbeat will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2899435