Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1999-06-29
2003-04-15
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S014000, C714S022000
Reexamination Certificate
active
06550017
ABSTRACT:
BACKGROUND OF THE INVENTION
This invention relates to a system and method for monitoring a distributed fault tolerant computer system. In particular, the invention is directed to monitoring and reporting the operation status of nodes of a distributed fault tolerant computer system. The invention can find application to the automatic configuration of a distributed fault tolerant computer system.
One application for a distributed fault tolerant system is in the telecommunications industry. The telecommunications industry is going through some fundamental changes that have caused a significant shift in the requirements placed on their information infrastructure. Deregulation of the services provided by the carriers, introduction of new wireless services, and the addition of information processing (IP) services have created new challenges and opportunities in this rapidly growing industry. The competition in the industry has resulted in significant reduction in the time available to service providers to test and develop their own systems.
Traditionally, telecommunication companies that have relied on hardware fault tolerant systems and extensive testing of their applications to discover system and application software faults. However, the competition and the need to bring new services to the market quickly means that such an approach is no longer possible in all cases if the service providers are to provide new services while maintaining the level of service and reliability that their customers are accustomed to.
Distributed Fault Tolerant (DFT) systems provide the basis for one approach specifically to address the requirements of a changing telecommunication industry. A DFT system has the potential to tolerate not only the failures of the hardware components of the system, but also the failures of its software elements. A traditional lock-step hardware fault tolerant system is perfectly capable of masking hardware component failures from its users but it is unable to accomplish the same for a software failure. The difficulty arises from the fact that the redundant hardware components of such a system execute the same set of instructions at the same time on effectively the same system and are, therefore, subject to the same set of software failures.
While it is possible to discover and correct “functional” bugs in the software by a rigorous qualification cycle, it is far more difficult to detect and correct the failures associated with the execution environment of a program. Such “Heisenbugs”, as they are called, are rarely discovered and corrected during the normal testing and qualification cycle of the system and occur only under circumstances that are very difficult to reproduce. The observation that the execution of the same program on the same (or identically configured) system, but at a different time, does not result in the same “Heisenbug” is the key to making it possible to tolerate such failures via redundancy, fault isolation, and fault containment techniques. DFT is based on this observation and uses redundant hardware and software components to achieve both hardware and software fault tolerance by isolating and containing the domain of such failures to a single member of the distributed system. Accordingly, it is desirable that a DFT system should be able to identify at least software failures that lead to the inoperability of a node of the system.
Moreover, in the telecommunications industry, stringent timing and availability requirements are set. Most applications in this market differ from those in other commercial sectors by the requirement for a “real-time” behavior. This places the requirement on the computing infrastructure that must incorporate the notion of “real-time” into its design and effectively guarantee that certain actions occur within a specified period. While it may be acceptable for a “mission-critical” enterprise system to have a large degree of variance in the time that it takes to respond to the same service request at different times, such a non-deterministic behavior cannot be tolerated by a telecommunications computer system. In order to meet these stringent timing requirements, the industry has resorted to proprietary hardware and software components resulting in a complicated application development environment, increased time to market, and reluctance in adopting new and efficient programming techniques. It would be desirable to enable a DFT system to address the unique requirements of the telecommunications industry without introducing an unnecessarily complicated programming model. Thus, it would be desirable to use, wherever possible, standard Off-The-Shelf (OTS) hardware and software components that allow for application development in a modem environment. It would therefore be desirable to minimize the amount of special purpose hardware and software needed.
One of the most important requirements of a telecommunication computer system is its availability. This is typically measured in the percentage of time that the system is available. However, it can also be stated in terms of the time that the system is unavailable. From this figure it is possible to calculate the maximum length of service disruption due to a failure. However, such a derivation assumes that the maximum number of failures over a period of time is known and that failures (or unplanned outages) are the only cause of service unavailability. Instead, a second requirement is commonly used that determines the maximum length of the service unavailability due to a failure. Another requirement of a telecommunication computing system stems from its unique maintenance and service model. While it is perfectly reasonable to assume that an enterprise system will be serviced and maintained locally by a system administrator conversant in the current technology, such an assumption is not valid for a telecommunication system where the system is typically located in a Central Office (CO) miles away from the nearest suitable system administrator. This lack of trained service and maintenance personnel translates the implicit competence of such personnel into explicit system requirements. Accordingly, it would be desirable to provide a structure that provides the basis for achieving at least a degree of automation of fault reporting and system reconfiguration.
The invention seeks to provide a monitor system that provides the potential to address at least some of the problems and desires mentioned above.
SUMMARY OF THE INVENTION
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Combinations of features from the dependent claims may be combined with features of the independent claims as appropriate and not merely as explicitly set out in the claims.
In accordance with one aspect of the invention, there is provided a monitor system for a distributed fault tolerant computer system. The monitor system includes a counter mechanism operable to count from a reset value towards a fault value and to output a fault signal if the fault value is reached. A counter reset routine is implemented in software and is operable repeatedly to reset the counter mechanism to its reset value during normal operation of the counter reset routine, thus preventing the counter mechanism from reaching the fault value during normal software operation. A unit connectable to a bus to supply a status signal indicative of the status of the unit is arranged to be responsive to a fault signal being output from the counter mechanism to provide an OFF status indication to the bus.
In this manner, a monitor system is able to detect a fault in the software running on the node (for example if the operating system hangs) and to report this to the bus. This can be achieved through the minimum of special purpose hardware. Moreover, as will be described with respect to preferred embodiments of the invention, the monitor system provides the potential to achieve a degree of automation with respect to the reporting of faults and the configuration of the distributed fault
Dickinson Peter Martin Grant
Moiin Hossein
Beausoliel Robert
Bonzo Bryce P.
Kivlin B. Noäl
Meyertons Hood Kivlin Kowert & Goetzel P.C.
Sun Microsystems Inc.
LandOfFree
System and method of monitoring a distributed fault tolerant... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method of monitoring a distributed fault tolerant..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method of monitoring a distributed fault tolerant... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3011949