Enhanced instrumentation software in fault tolerant systems

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S047300, C714S015000, C714S025000

Reexamination Certificate

active

06360338

ABSTRACT:

NOTICE REGARDING COPYRIGHTED MATERIAL
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION
This invention deals generally with software in fault tolerant systems and specifically with fault tolerant instrumentation software for monitoring multiple processes in a distributed multi-processing network.
Monitoring and control of data plays an important role in today's computer systems. Especially where large computer systems deal with large amounts of information, as in, for example, a distributed transaction-based data base system, the ability to receive information from any of a number of processes that make up the data base service and the ability to control or otherwise affect the operation of the service processes has advantages. One advantage is, that the system can be selectively monitored by a human or an automated management system such as another computer system. Another advantage is that the operation of the system can be affected in real time without bringing the system to a halt to load in and execute modified software to implement the services or processes.
Monitoring and control of software in real time is also referred to as “instrumenting” the software being executed.
FIG. 1
shows a generalized computer network
10
that includes several processors such as processor
12
, processor
14
, etc. Each processor typically includes a central processing unit (CPU), random access memory (RAM), disk drive, etc. In the generalized computer network of
FIG. 1
, the processors may be any type of processor or computer system as is commonly known in the art. The processors typically execute software to perform tasks. The software can be thought of in terms of singular “processes” and are shown as circles within the processor rectangles such as process
22
within processor
16
. A process such as process
22
may be an operating system process, application program process, etc. and can perform tasks such as math computations, data base manipulation, communication task, etc. In today's distributed networks, processes can be split up over several processors so that multi-processing takes place. For example, process
22
can be part of a graphics-rendering task in which processes
24
,
26
and
28
are also participating. Thus, in a distributed multi-processor network, it is often irrelevant where a certain process is executing.
Processes can communicate with other processes by sending messages over the network. For example, in
FIG. 1
, message
30
is being transferred over network
32
from process
22
to process
28
. The processes reside, respectively, on processor
16
and processor
20
. Message
30
may be, for example, a packet of data if the generalized network
10
is a packet switch network.
In
FIG. 1
, network
32
may be any type of network. Further, the interconnections between processors may be by hardwire, radiowave, fiber optic, or other types of connections. The ability of processes on different processors to communicate quickly and efficiently over network
32
is very important toward realizing an efficient distributed network.
A processor, such as processor
20
in
FIG. 1
, may have specific hardware attached to it to perform tasks such as interfacing with a human. Processor
20
is shown to have a display
32
and keyboard
34
for performing, respectively, output and input to a human user. Such devices are useful, for example, to allow a human to monitor and control whatever tasks are being performed by the various processors and processes attached to network
32
. One example of a task or “service” is a distributed data base system where multiple users at multiple processors can be connected to multiple other processors for purposes of accessing a data base that resides on storage media connected to the network. In
FIG. 1
, it is assumed that each processor has some of its own resources, such as RAM and other storage media. However, typically a network will provide shared resources such as a large disk array that can be accessed by any of the processors in turn.
Where processor
20
is executing a process, such as process
28
, to implement a monitoring and control function so that a user operating keyboard
34
and viewing display
32
can receive information on, and transfer information to, various processes in the network, it is, naturally, important that the monitoring and control function be accurate and reliable. In traditional systems, it is a simple matter to ensure that monitoring and control is implemented reliably if it is acceptable for a failure of one or more of the components in generalized network
10
to cause a halt in the monitoring and/or control activity.
For example, assuming process
28
is monitoring process
22
so that process
28
receives information from process
22
in the form of messages such as message
30
sent, from time to time, from process
22
to process
28
. Under normal operation, process
28
would receive messages containing information on the state or status of process
22
and display this information to a user on display
32
. Also, messages can be transferred in the other direction from process
28
to process
22
in response to a user's input at keyboard
34
. The messages from the monitoring and control process
28
to the monitored and controlled process
22
could change the way process
22
operates.
If a failure occurs, such as processor
16
being rendered inoperable, process
22
would cease to transmit messages and would also cease to receive and act upon messages. For such a failure is not catastrophic to the operation of the network, or service provided by the network system, such a failure of processor
16
, and inability of process
22
to communicate, would eventually be detected. Once detected, process
28
could simply be directed to cease communications with process
22
. Alternatively, another process could be launched on a different processor to duplicate the task formally performed by process
22
. Then, process
28
could resume communications with the substitute process. However, note that this might mean messages have been lost between process
28
and process
22
since processor
16
may have failed after process
28
had sent a message and before process
22
had received it. Also, the failure of processor
16
may mean that a message that should have been generated by process
22
and transmitted to process
28
was never generated or received by process
28
. In systems where fault tolerance is not important, this is not a problem. However, a problem arises in distributed processing in network systems that are performing services where loss of communications and other data faults are not acceptable. An example of a system where fault tolerance is required is transaction processing in a data base system where the transactions are financial.
Therefore, it is desirable to have a system that monitors and controls a software service while providing fault tolerance.
SUMMARY OF THE INVENTION
A first aspect of the invention discloses a method for providing fault tolerant monitoring and control in a distributed processing network. The network includes a plurality of computer systems executing a plurality of service processes that cooperatively perform a function. Monitored processes and exporter processes exchange messages.
An exporter process sends messages to a monitored process about the state of one or more service processes. The exporter process receives messages from the monitored process and transfers information to one or more controlled service processes. The method includes the steps of: receiving with the monitored process, a message that a first process is disabled; in response to the receiving step, performing the following ste

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Enhanced instrumentation software in fault tolerant systems does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Enhanced instrumentation software in fault tolerant systems, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Enhanced instrumentation software in fault tolerant systems will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2816484

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.