High availability platform with fast recovery from failure...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S055000, C714S011000

Reexamination Certificate

active

06651185

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to high availability computer platforms, especially those used for Signalling System 7 (SS7) network management.
BACKGROUND OF THE INVENTION
An exemplary high availability platform wherein the invention may be used is disclosed, for instance, in Hewlett-Packard Journal, August 1997, “High Availability in the HP OpenCall SS7 Platform”.
FIG. 1
illustrates the fault tolerant mechanism used in this platform. The platform comprises two systems, an active system A and a rescue system B. In practice systems A and B are two separate computers.
Each system runs a Fault Tolerant Controller (FTC) process which is in charge of managing the vital processes running on the system. System A runs a plurality of active processes P
1
a
, P
2
a
, P
3
a
. . . , while system B runs a plurality of standby processes P
1
s
, P
2
s
, P
3
s
. . . , respectively corresponding to the active processes of system A. The standby processes, although inactive, are periodically synchronized with their corresponding active processes by replication messages, so that they are ready to take-over the task of the active processes at any time.
The health of the processes is monitored by the FTC through a heart-beat mechanism. Each process, either active or standby, periodically sends a heart-beat message to the corresponding FTC. If the FTC does not receive such a message within a predetermined amount of time (a preset “time-out”), it will declare the process dead and carry out any necessary action, such as respawning the process.
In fact, in order to obtain a higher degree of confidence, a process is declared dead only if the FTC receives a heart-beat failure information through a second path. This second path is established through the other system, via heart-beat messages sent between the active processes and their respective standby processes and between the FTCs running on the two systems.
For instance, if process P
1
a
dies, the FTC of system A will detect the absence of a heart-beat, and so will process P
1
s
. Process P
1
s
then sends the FTC of system B, within a heart-beat message, information indicating that process P
1
a
may be dead. The FTC of system B passes this information, also within a heart-beat message, to the FTC of system A, which can thus double-check the fact that process P
1
a
is dead.
If there is a contradiction between the heart-beat information obtained directly from process P
1
a
and indirectly from process P
1
s
, the FTC may take other measures to ensure process P
1
a
is in good health, such as explicitly killing it and respawning it.
The heart-beat period and the time-out should be chosen such that a dead process is respawned within an acceptable period of time. The time-out after which an FTC declares a mute process dead is chosen to be slightly greater than the longest atomic operation a process may have to carry out. An atomic operation, in a multi-tasking system, is an operation that cannot be interrupted, for instance to switch between two concurrent tasks. The sending of a heart-beat message by a process requires an interruption of the process, for instance by a timer. If a process is carrying out an atomic operation when it receives an interruption request, the process will only respond to the interruption request at the end of the atomic operation.
In the above exemplary OpenCall SS7 platform, important processes, such as the SS7 protocol stack, should not be unavailable (dead) more than a relatively short period of time, targeted for instance at 6 seconds. This means the time-out period must in principle be smaller than 6s and the heart-beat period even smaller. Moreover, an FTC on one system will declare a process dead and respawn it, in practice, only after receiving a confirmation from the FTC on the other system. Such confirmation is delayed by twice the message transport overhead between systems A and B, and will be expected within a second time-out period. This puts additional constraints on the time parameters, to the extent that it may not be possible to satisfy the targeted 6 seconds for respawning the protocol stack. It is also difficult to increase the heart-beat frequency in order to relax other constraints, because the processing of the heart-beat messages would become excessively CPU-time consuming. A value of 2 seconds for the heart-beat period is a tradeoff between low CPU-time consumption and short reaction time.
Moreover, several processes likely to run on the above platform may carry out atomic operations which take a long time with respect to the heart-beat period. One such process is a database manager which responds to database queries in a short time, but which must periodically carry out a database update in an atomic operation. Such an update takes a time depending on the size of the database, and may be on the order of the minute.
Typically processes like the database manager, requiring the time-out to be set above one minute, would not be monitored, so that the time-out can be set to a value compatible with important processes that must be respawned quickly, such as the SS7 protocol stack. However, problems then arise if such processes die unexpectedly.
SUMMARY OF THE INVENTION
The present invention is directed in general to providing the above platform with a quick respawning capability of important processes.
One difficulty to overcome is that of monitoring processes requiring long atomic operations, without disabling the quick respawning capability of other processes.
Another difficulty is to satisfy targeted respawning times of certain processes.
These difficulties are overcome in a high availability platform arranged, in operation, to run a fault-tolerant controller process (FTC) and at least one monitored process arranged to indicate its live state by periodically sending a heart-beat message to the FTC. The FTC is arranged to respond to the heart-beat message by modifying the frequency at which it expects the heart-beat message according to information contained therein.
The platform may be arranged to run an additional process, the monitored process being arranged to regularly send the additional process a message and to notify the FTC that the additional process is dead when it receives an error code from an operating system after sending a message to the additional process.
The platform may comprise two systems, each arranged, in operation, to run an FTC and associated monitored processes, a plurality of processes running on at least a first system being in standby mode and corresponding to respective active processes on the second system, such that, if the second system fails, the standby processes of the first system become active and take over the tasks of the processes that were active on the second system. When it is necessary to force shut down of one system, the FTC of this system is arranged to send the processes a switch-over signal, causing said active processes to die and the respective standby processes on the other system to become active through a transition phase in which the processes do not perform input/output operations.


REFERENCES:
patent: 5737515 (1998-04-01), Matena
patent: 5978933 (1999-11-01), Wyld et al.
patent: 6023772 (2000-02-01), Fleming
patent: 6148415 (2000-11-01), Kobayashi et al.
patent: 6272113 (2001-08-01), McIntyre et al.
patent: 6370656 (2002-04-01), Olarig et al.
patent: 6477663 (2002-11-01), Laranjeira et al.
Wyld, Brian C. et al., “High Availability in the HP OpenCall SS7 Platform”,Hewlett-Packard Journal, pp. 65-71 (Aug. 1997).

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

High availability platform with fast recovery from failure... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with High availability platform with fast recovery from failure..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and High availability platform with fast recovery from failure... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3184617

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.