Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1999-05-06
2002-09-17
Iqbal, Nadeem (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C712S227000
Reexamination Certificate
active
06453430
ABSTRACT:
BACKGROUND OF THE INVENTION
Modem computer controlled devices rely heavily on the proper functioning of software processing to control their general operation. Typically in such devices, an operating system is made up of one or more software programs that execute on a central processing unit (CPU) in the device and schedules the operation of processing tasks. During execution, the operating system provides routine processing functions which may include device resource scheduling, process control, memory and input/output management, system services, and error or fault recovery. Generally, the operating system organizes and controls the resources of the device to allow other programs or processes to manipulate those resources to provide the functionality associated with the device.
Modem central processing units (i.e. microprocessors) can execute sequences of program instructions very quickly. Operating systems which execute on such processors take advantage of this speed by scheduling multiple programs to execute “together” as individual processes. In these systems, the operating system divides the total available processor cycle time between each executing process in a timesliced manner. By allowing each process to execute some instructions during that process-designated timeslice, and by rapidly switching between timeslices, the processes appear to be executing simultaneously. Operating systems providing this capability are called multi-tasking operating systems.
Fault management and control in devices that execute multiple processes is an important part of device operation. As an example of fault control, suppose that a first process depends upon a second process for correct operation. If the second process experiences an error condition such as failing outright, hanging or crashing, the first dependent process may be detrimentally effected (e.g., may operate improperly).
Fault control in some form or another is usually provided by the operating system, since the operating system is typically responsible for dispatching and scheduling most processes within a computer controlled device. Many prior art operating systems and device control programs include some sort of process-monitoring process such as a dispatcher process, a monitor daemon, a watchdog process, or the like. One responsibility of such monitoring processes is to restart failed, hung or crashed processes.
As an example, prior art U.S. Pat. No. 4,635,258, issued Jan. 6, 1987, discloses a system for detecting a program execution fault. This patent is hereby incorporated by reference in its entirety. The fault detection system disclosed in this patent includes a monitoring device for monitoring the execution of program portions of a programmed processor for periodic trigger signals. Lack of trigger signal detection indicates a fault condition and the monitoring device generates a fault signal in response to a detected faulty program execution condition. Logic circuitry is included for restarting a process that the monitoring device indicates has faulted. The monitoring device may require a predetermined number of trigger signals before indicating an alarm condition. The system also includes circuitry for limiting the number of automatic restarts to a predetermined number which avoids continuous cycling between fault signal generation and reset.
SUMMARY OF THE INVENTION
Prior art fault management systems can experience problems in that these systems restart a failed process with little or no regard to why the process faulted in the first place. Prior art systems using this approach can heavily burden the processor and other system resources, since restarting a process can require significant overhead. In extreme cases, high system overhead may have caused the fault in the first place, and the prior art restart mechanisms which further load the system only serve to compound the problem.
Furthermore, by not determining the cause of the fault, prior art systems that simply restart faulted processes may end up creating lengthy, and possibly “infinite” process restarting loops. Such loops may over-utilize system resources such as processor bandwidth and memory in attempts to endlessly rejuvenate a failed process that may be failing due to an external event beyond the control of the process.
Those prior art systems that attempt restarts with only a limited number of allowed restarts may avoid the problem of endless process restarting loops, but still suffer from system over-utilization during the restart period.
In contrast, the present invention provides a unique approach to fault management. In this invention, fault conditions related to processes can be handled passively, actively, or both passively and actively through the use of a unique process restart sequences. Passively handling faults, called passive fault management, comprises detecting faults and waiting for a period of time for condition that lead to the fault to change and the fault to correct itself. On the other hand, active fault management attempts to determine the cause of the fault and to remedy the situation thereby preventing future faults.
In this invention, the process restart sequences allow restarting of failed process according to a sequence or schedule that manages the loads placed on system resources during failure and restarting conditions while maintaining the utmost availability of the process. In real-time or mission critical environments, such as in data communications networking devices or applications, the invention provides significant advancements in fault management.
More specifically, embodiments of the present invention relate to systems, methods and apparatus for handling processing faults in a computer system. According to a general embodiment of the invention, a system provides a method of detecting a fault condition which causes improper execution of a set of instructions. The system then determines a period of time to wait in response to detecting the fault condition and waits the period of time in an attempt to allow the fault condition to be minimized. This is an example of passive fault management. The system then initiates execution of the set of instructions after waiting the period of time. The system then repeats the operations of detecting, determining, waiting and initiating. Preferably, each repeated operation of determining a period of time determines successively longer periods of time to wait. Accordingly, this embodiment of the invention provides a passive process restart back-off mechanism that allows restarting of processes in a more controlled and time-spaced manner which conserves system resources and reduces peak processing loads while at the same time attempting to maintain process availability.
Preferably, the system is implemented on a computer controlled device, such as a data communications device. The device includes a processor, an input mechanism, an output mechanism, a memory/storage mechanism and an interconnection mechanism coupling the processor, the input mechanism, the output mechanism, and the memory/storage mechanism. The memory/storage mechanism maintains a process restarter. The process restart is preferably a process or program that executes as part of, or in conjunction with the operating system of the device. The invention is preferably implemented in an operating system such as the Cisco Internetworking Operating System (IOS), manufactured by Cisco Systems, Inc., of San Jose, Calif.
The process restarter executes in conjunction with the processor and detects improper execution of a set of instructions on the processor and re-initiates execution of the same set of instructions in response to detecting improper execution. The process restarter also repeatedly performs the detecting and initiating operations according to a first restart sequence, and repeatedly performs the detecting and initiating operations according to a second restart sequence. The second restart sequence causes the process restarter to initiate execution of the set of instructions in a different sequence than the firs
Singh Daljeet
Waclawsky John G.
Chapin Barry W.
Chapin & Huang , L.L.C.
Cisco Technology Inc.
Iqbal Nadeem
LandOfFree
Apparatus and methods for controlling restart conditions of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and methods for controlling restart conditions of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and methods for controlling restart conditions of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2828114