Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2001-01-19
2004-04-06
Iqbal, Nadeem (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S026000, C714S048000, C714S057000
Reexamination Certificate
active
06718482
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to a computer system and a method of monitoring faults occurring in the computer system and more particularly, to a fault monitoring system for monitoring a fault when the fault takes place in software.
Conventionally, as a technique for monitoring faults in a computer such as a personal computer, a technique disclosed in, for example, JP-A-9-50386, JP-A-5-250284 or JP-A-5-257914 has been known.
According to the technique as above, the computer representing an object to be monitored is connected with an optional board, dedicated to fault monitoring, which carries a processor independent of another processor possessed by the main body of the computer. The optional board monitors a state of hardware in the computer main body to detect faults in the hardware and besides, communicates periodically with a monitor program operating on the computer to detect faults in software.
In the event that the optional board detects a fault, the generation of the fault is notified to a different computer connected through a network by using a communication mechanism owned by the optional board or the computer. The computer connected through the network can perform power control in the monitored computer (on/off of the power supply) and can be rebooted.
For remote control of computers, an object computer must be controlled through the network. Typically, for the sake of controlling the object computer through the network, communication with software operating on the object computer is effected to transmit a control request inputted through the network to the software on the object computer. The software on the computer to be controlled receives the transmitted control request to execute a process complying with the request.
The remote control of the computer as above, however, presupposes that the software operating on the computer representing the control object operates normally. Accordingly, when a fault occurs in the software operating on the computer representing the control object, there is a possibility that the remote control cannot fulfill itself. Especially, when an operating system (OS) becomes faulty, communication per se cannot sometimes be implemented through the network. Such a disadvantage becomes fatally problematic in executing fault monitoring for a computer at a remote location from another computer connected through the network.
In the technique disclosed in the JP-A-9-50386, an optional board for fault monitoring is used to make periodical communication between software operating on a computer representing an object to be monitored and the optional board in order that a fault in the software can be detected by the presence or absence of a response in the communication. When a fault is detected, the fault is notified to another computer by means of the communication function of the optional board. According to this technique, even in the event that a fault takes place in the monitored computer, fault notification and computer control from a remote location can be implemented.
The technique disclosed in the JP-A-9-50386, however, faces problems as below.
(1) In the event of the software fault occurrence, software information such as information concerning a state of the software operating on the computer main body or information managed and held by the software cannot be collected.
(2) Since the optional board has the communication function operative independently of the computer main body, only a communication program adopting a network protocol supported by the optional board can be utilized and the function for implement is limited.
(3) Communication is made between the optional board and the monitored computer during occurrence of a fault, thus requiring a program operating on the optional board, but an amount of resources such as memories is smaller on the optional board than on the computer main body and the function for implement is limited.
The problems enumerated in (2) and (3) above can be solved by implementing a plurality of network protocols in the optional board or adding resources to the optional board per se. Even in that case, however, there arises a problem that costs of development of the optional board and costs of production increase.
SUMMARY OF THE INVENTION
An object of the present invention is to provide a fault monitoring system which, even when a fault occurs in a computer, can control the computer by a request command from a different computer connected to the computer through a network.
Another object of the invention is to provide a fault monitoring system which can transmit fault information to the different computer connected through the network even when a software fault takes place in the computer representing an object to be monitored.
Still another object of the invention is to relieve the limited function due to a shortage of computer resources in the monitored computer.
To accomplish the above objects, according to the present invention, a computer representing an object to be monitored (a monitored computer) is connected to a computer for monitoring the monitored computer (a monitoring computer) through a network.
In a preferred embodiment of the invention, the monitored computer includes a multi-OS controller for operating a plurality of OS's on the single computer, and a first software environment which is constructed by a first OS and serves as an object to be monitored and a second software environment which is constructed by a second OS and is independent of the first software environment are formed on the monitored computer.
On the second software environment, communicating means for making communication with the different computer through the network and a fault monitor agent for monitoring the occurrence of software faults in the first software environment operate. When detecting the occurrence of a fault in the first software environment, the fault monitor agent notifies the monitoring computer of the fault occurrence. Receiving the notification, the monitoring computer communicates with the fault monitor agent to command that the monitored computer should be controlled. Responsive to the command from the monitoring computer, the fault monitor agent controls the monitored computer.
In an embodiment of the invention, the fault notification and the control operation of the monitored computer are commanded by electronic mail (E-mail).
The fault monitor agent carries out detection of a fault in the first software environment by monitoring an alive message delivered out of another fault monitor agent operating on the first software environment.
REFERENCES:
patent: 5721922 (1998-02-01), Dingwall
patent: 5787409 (1998-07-01), Seiffert et al.
patent: 5805790 (1998-09-01), Nota et al.
patent: 6477667 (2002-11-01), Levi et al.
patent: 6615376 (2003-09-01), Olin et al.
patent: 2002/0120884 (2002-08-01), Nakamikawa et al.
patent: 2002/0129305 (2002-09-01), Ahrens et al.
patent: 2002/0188895 (2002-12-01), Quach et al.
patent: 2003/0097422 (2003-05-01), Richards et al.
patent: 5-250284 (1993-09-01), None
patent: 5-257914 (1993-10-01), None
patent: 9-50386 (1997-02-01), None
Kimura, Shinji. High-reliability and High-availability DARMA Nanokernel. Hitachi-SDL. pp. 1-10.*
“Modern Operating Systems”, Prentice Hall, 1992, Andrew S. Tanenbaum, pp. 21-22 & 637-641.
Arai Toshiaki
Kimura Shinji
Sato Masahide
Umezu Toshikazu
Hitachi , Ltd.
Iqbal Nadeem
Mattingly Stanger & Malur, P.C.
Wilson Yolanda L
LandOfFree
Fault monitoring system does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Fault monitoring system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fault monitoring system will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3235639