Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-09-24
2001-06-12
Ray, Gopal C. (Department: 2181)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S004110, C707S793000, C709S203000
Reexamination Certificate
active
06247141
ABSTRACT:
BACKGROUND
The invention relates to fault tolerant server systems, and more particularly to fault tolerant server systems including redundant servers.
High availability of service in a telecommunication system can be achieved by means of fault tolerant computers or distributed system architectures. The use of this redundancy, however, may adversely affect other system properties. For example, the utilization of redundancy on the hardware level increases cost, physical volume, power dissipation, fault rate, and the like. This makes it impossible to use multiple levels of redundancy within a system.
For example, distributed systems can incorporate replication between computers, in order to increase robustness. If each of these computers are fault tolerant, costs will multiply. Furthermore, if backup copies are kept in software, for the purpose of being able to recover from software faults, the cost of the extra memory will multiply with the cost of the fault tolerant hardware, and for the multiple copies in the distributed system. Thus, in order to keep costs low, it is advisable to avoid the use of multiple levels of redundancy. Since the consequence of such a design choice is that only one level of redundancy will be utilized, it should be selected so as to cover as many faults and other disturbances as possible.
Disturbances can be caused by hardware faults or software faults. Hardware faults may be characterized as either permanent or temporary. In each case, such faults may be covered by fault-tolerant computers. Given the rapid development of computer hardware, the total number of integrated circuits and/or devices in a system will continue to decrease, and each such integrated circuit and device will continue to improve in reliability. In total, hardware faults are not a dominating cause for system disturbances today, and will be even less so in the future. Consequently, it will be increasingly more difficult to justify having a separate redundancy, namely fault tolerant computers, just to handle potential hardware faults.
The same is not true with respect to software faults. The complexity of software continues to increase, and the requirement for shorter development time prevents this increasingly more complex software from being tested in all possible configurations, operation modes, and the like. Better test methods can be expected to fully debug normal cases. For faults that occur only in very special occasions, the so-called “Heisenbuggs”, there is no expectation that it will be either possible or economical to perform a full test. Instead, these kinds of faults need to be covered by redundancy within the system.
A loosely coupled replication of processes can cover almost all hardware and software faults, including the temporary faults. As one example, it was reported in I. Lee and R. K. Iyer, “Software Dependability in the Tandem Guardian System,” IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, vol. 21, No. 5, May 1995 that checkpointing (i.e., the copying of a present state to a stand-by computer) and restarting (i.e., starting up execution from a last checkpointed state by, for example, reading a log of the transactions that have occurred since the last checkpoint and then starting to process new ones) covers somewhere between 75% and 96% of the software faults, even though the checkpointing scheme was designed into the system to cover hardware faults. The explanation given in the cited report is that software faults that are not identified during test are subtle and are triggered by very specific conditions. These conditions (e.g., memory state, timing, race conditions, etc.) did not reoccur in the backup process after it took over; consequently, the software fault does not reoccur.
A problem with replication in a network is that there are a few services, such as arbitration of central resources, that do not lend themselves to distribution. This type of service must be implemented in one process and needs, for performance reasons, to keep its data on its stack and heap. To achieve redundancy, this type of process must then be replicated within the distributed network. In a high performance telecommunication control system this replication must be done with very low overhead and without introducing any extra delays.
SUMMARY
It is therefore an object of the present invention to provide methods and apparatuses for implementing a fault-tolerant client-server system.
In accordance with one aspect of the present invention, the foregoing and other objects are achieved in a fault-tolerant client-server system that comprises a primary server, a backup server and a client. The client sends a request to the primary server. The primary server receives and processes the request, including sending a response to the client, independent of any backup processing being performed by the primary server, wherein the response includes primary server state information. By sending the response independent of backup processing, a higher level of concurrence is achieved, thereby making the system more efficient. The primary server also performs backup processing, including periodically sending the primary server state information to the backup server. The client receives the response from the primary server, and sends the primary server state information from the client to the backup processor.
In another aspect of the invention, the primary server state information includes all request-reply pairs that the primary server has handled since a most recent transmission of primary server state information from the primary server to the backup server.
In yet another aspect of the invention, the primary server stores the primary server state information in storage means. The act of performing backup processing in the primary server may be performed in response to the storage means being filled to a predetermined amount.
In an alternative embodiment, the act of performing backup processing in the primary server may be performed periodically based on a predetermined time interval.
REFERENCES:
patent: 4879716 (1989-11-01), McNally et al.
patent: 5005122 (1991-04-01), Griffin et al.
patent: 5307481 (1994-04-01), Shimazaki et al.
patent: 5434994 (1995-07-01), Shaheen et al.
patent: 5452448 (1995-09-01), Sakuraba et al.
patent: 5455932 (1995-10-01), Major et al.
patent: 5488716 (1996-01-01), Schneider et al.
patent: 5513314 (1996-04-01), Kandasamy et al.
patent: 5526492 (1996-06-01), Ishida
patent: 5566297 (1996-10-01), Devarakonda et al.
patent: 5581753 (1996-12-01), Terry et al.
patent: 5634052 (1997-05-01), Morris
patent: 5652908 (1997-07-01), Douglas et al.
patent: 5673381 (1997-09-01), Huai et al.
patent: 5696895 (1997-12-01), Hemphill et al.
patent: 5751997 (1998-05-01), Kullick et al.
patent: 5796934 (1998-08-01), Bhanot et al.
patent: 0838758A2 (1998-04-01), None
Murthy Devarakonda, et al., “Server Recovery Using Naturally Replicated State: A Case Study,” IBM Thomas J. Watson Research Center, Yorktown Hts, NY, IEEE Conference on Distributed Computing Systems, pp. 213-220, May 1995.
Kenneth P. Birman, “The Process Group Approach to Reliable Distributed Computing”,Reliable Distributed Computing with the Isis Toolkit, pp. 27-57, ISBN 0-8186-5342-6), reprinted fromCommunications of the ACM, Dec. 1993.
Robbert Van Renesse, “Causal Controversy at Le Mont St.-Michel”,Reliable Distributed Computing with the Isis Toolkit, pp. 58-67, (ISBN 0-8186-5342-6), reprinted fromACM Operating Systems Review, Apr. 1993.
Kenneth P. Birman, “Virtual Synchrony Model”,Reliable Distributed Computing with the Isis Toolkit, pp. 101-106, (ISBN 0-8186-5342-6) 1994.
Carlos Almeida, et al. “High Availability in a Real-Time System”,Reliable Distributed Computing with the Isis Toolkit, pp. 167-172, (ISBN 0-8186-5342-6), reprinted fromACM Operating Systems Review, Apr. 1993 andProceedings of the 5thACM SIGOPS Workshop, Sep. 1992.
Kenneth P. Birman, et al., “Reliable Communication in the Presence of Failures”,Reliable distributed Computing with the Isis Toolkit, pp. 176-200, (ISBN 0-8186-5342-6), reprinted fromACM Transaction on
Burns Doane Swecker & Mathis L.L.P.
Ray Gopal C.
Telefonaktiebolaget LM Ericsson (publ)
LandOfFree
Protocol for providing replicated servers in a client-server... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Protocol for providing replicated servers in a client-server..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Protocol for providing replicated servers in a client-server... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2531035