Operating system hang detection and correction

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S023000, C714S055000, C713S002000

Reexamination Certificate

active

06587966

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Technical Field of the Invention
The present invention relates in general to the field of computer systems, and in particular, by way of example but not limitation, to operating system hang detection and correction within a computer system.
2. Description of Related Art
Despite advances in computer and operating system architectures, computer systems continue to be vulnerable to operating system hang conditions from time to time. The primary cause of this vulnerability is that operating system hang conditions may occur for a wide variety of reasons that are difficult to predict and even more difficult to completely avoid. For example, an operating system hang condition may occur due to insufficient system resources, incompatible use of the available resources, incompatible device drivers or errors in the operating system or application software. Furthermore, the particular configuration of the system may produce an operational state that the operating system was not originally designed to handle, or may continuously generate an event, such as a hardware or software interrupt, that the operating system cannot clear using available methods. As a result, the operating system may enter a continuous loop or an unknown operational state from which the operating system cannot recover without some form of intervention, such as a system reset.
The problems associated with operating system hang conditions are exacerbated by the fact that the user typically cannot distinguish between an operating system hang condition and an unusual processing delay. As a result, the user may experience uncertainty and/or frustration in attempting to determine whether an operating system hang condition has occurred. Inexperienced users, for example, may wait an inordinate amount of time for the computer system to respond to user input, unaware that the operating system is no longer functioning. Furthermore, in applications where the computer system is intended to continuously operate for long periods of time, such as Web servers, file servers, database servers or network servers, the failure to detect an operating system hang condition can become especially problematic. Because these computer systems typically perform tasks that are critical to an organization's business operations, system downtime caused by the failure to detect an operating system hang condition may be unacceptable.
Existing approaches for detecting operating system hang conditions have proven to be inadequate or unreliable in that they typically rely on user observation of system activity. For example, the user may attempt to determine whether an operating system hang condition has occurred by monitoring for hard drive activity, by testing for pointer and/or keyboard responsiveness or by actuating the “NumLock” key to determine if the key's associated LED changes state. These approaches, however, have limited reliability in that they rely upon the subjective judgment of the individual user which varies significantly depending upon the user's level of experience. Furthermore, these approaches require the presence of a human operator to perform physical observations of system activity and to perform a system reset in the event an operating system hang condition is detected. As a result, these approaches may be completely inadequate for detecting operating system hang conditions in server-based applications where the physical observation of system activity by a human operator may not be a viable or cost effective option.
Therefore, in light of the deficiencies of existing approaches, there is a need for a mechanism that detects and possibly corrects operating system hang conditions in a more reliable and cost effective manner.
SUMMARY OF THE INVENTION
The deficiencies of the prior art are overcome by the method, system and apparatus of the present invention. For example, as heretofore unrecognized, it would be beneficial to detect an operating system hang condition by setting a status flag to a first value, generating an operating system interrupt intended for an operating system interrupt handler within an operating system kernel that resets the status flag to a second value, executing the operating system interrupt handler if the operating system kernel is responding to the operating system interrupt and executing a system BIOS interrupt handler that measures a time interval in which the status flag is set to the first value without being reset to the second value. If the measured time interval exceeds a predetermined threshold, an operating system hang condition is presumed to have occurred and an appropriate procedure may be called that, for example, informs the user of an operating system malfunction, automatically performs a system reset, corrects the problem causing the operating system hang condition or performs combinations thereof.
In a first and preferred embodiment of the present invention, a timer is configured to set a status flag to a first value and generate an operating system interrupt in response to an overflow of the timer. The operating system interrupt is associated with a timer interrupt handler within an operating system kernel that functions to reset the status flag to a second value if the operating system kernel is responding to the operating system interrupt. A “watchdog” timer is also configured to periodically generate a system BIOS interrupt, where the system BIOS interrupt is associated with a watchdog timer handier. When the watchdog timer handler gains control of the processor, the watchdog timer handler increments a watchdog counter in response to the status flag having the first value and clears the watchdog counter in response to the status flag having the second value. If the watchdog counter exceeds a predetermined threshold, an operating system hang condition is presumed to have occurred and an appropriate procedure may be called that, for example, informs the user of an operating system malfunction, automatically performs a system reset, corrects the problem causing the operating system hang condition, or performs combinations thereof
The technical advantages of the present invention include, but are not limited to, the following exemplary technical advantages. It should be understood that particular embodiments may not involve any, much less all, of the following exemplary technical advantages.
An important technical advantage of the present invention is that it better enables a user to detect an operating system hang condition by utilizing a more reliable detection mechanism.
Another important technical advantage of the present invention is that it provides a cost effective mechanism for detecting an operating system hang condition by eliminating the need for physical observation of system activity by a human operator.
Yet another important technical advantage of the present invention is the ability to reduce uncertainty and/or frustration of a user by ensuring that the user is informed of an operating system hang condition so that the user may take appropriate action.
Yet another important technical advantage of the present invention is the ability to reduce system downtime by providing a mechanism that can correct an operating system hang condition and/or automatically perform a system reset in response to detection of an operating system hang condition.
The above-described and other features of the present invention are explained in detail hereinafter with reference to the illustrative examples shown in the accompanying drawings. Those skilled in the art will appreciate that the described embodiments are provided for purposes of illustration and understanding and that numerous equivalent embodiments are contemplated herein.


REFERENCES:
patent: 5245615 (1993-09-01), Treu
patent: 5390324 (1995-02-01), Burckhartt et al.
patent: 5513319 (1996-04-01), Finch et al.
patent: 5594865 (1997-01-01), Saitoh
patent: 5655083 (1997-08-01), Bagley
patent: 5944840 (1999-08-01), Lever
patent: 6012154 (2000-01-01), Poisner
patent: 6061810 (2000-05-01), Potter
pat

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Operating system hang detection and correction does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Operating system hang detection and correction, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Operating system hang detection and correction will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3082489

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.