Identifying field replaceable units responsible for faults...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C713S002000

Reexamination Certificate

active

06725396

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates in general to data processing systems and, in particular, to a method for identifying faults associated with field replaceable units (FRU) of a data processing system. Still more particularly, the present invention relates to a method for correctly identifying whether a fault identified by a processor timeout is isolated to the correct FRU utilizing Initial Program Load (IPL) boot progress status indicator.
2. Description of the Related Art
A typical data processing system consists of a central processing unit (CPU), memory components, and a number of device controllers that are typically connected through a system of buses that provides access to shared memory. In order to operate, the data processing system requires electrical power and software components that control the interactions between the various hardware components during operation.
For a data processing system to start running, for instance upon power-up or reboot, an initializing program is necessary. The initializing program, or bootstrap program, preferably initializes (i.e., activates) all components (hardware, firmware and software) of the data processing system, from CPU registers to device controllers and memory contents. At startup each of the various hardware components of the data processing system first performs an internal reset procedure to obtain a known stable state. Once these hardware reset procedures have completed successfully, each component of the data processing system performs a Logical Built-in Self-Test (LBIST) or an Array Built-in Self-Test (ABIST). A service processor then performs a LBIST or ABIST signature verification against a known signature value. Once the verification is complete, the service processor initializes each component of the data processing system.
Next, firmware is executed to complete the initialization process. In many data processing systems, this firmware includes Power-On-Self-Test (POST) software that surveys and performs sanity checks on the system hardware, a Basic Input Output System (BIOS) that interfaces processor(s) to key peripherals such as a keyboard and display monitor, and an operating system loader (bootstrap) program that launches execution of a selected operating system. These basic firmware procedures, which are often bundled together in a startup flash memory, enable the data processing system to obtain an operating state at which the data processing system is available to execute software applications.
During execution, the service processor and firmware typically interact with one specific component within the data processing system at a time. When a system “hang” occurs during startup, there is a high probability that the cause of the system “hang” is related to the component that the firmware or the service processor is accessing at that time. Without any additional knowledge, however, the identification of the source of error is typically accomplished by replacing each adapter card in the data processing system to determine whether or not the adapter card caused the system “hang.”
State-of-the art data processing systems utilizing specialized processor chips generally include a hang detection mechanism for the firmware-encountered hangs, described above. For example, the Power PC 630 processors (i.e., data processing systems with 630 processor chips) have a built in hang detection mechanism, which is triggered when the 630 processor chip stops executing instructions. Unfortunately, in some of these instances, however, false FRU faults are indicated when the condition that actually causes the timeout/fault occurs at boot time and the processor card is not the cause of the error. In these cases, the 630 watchdog times out because the input/output (I/O) subsystem is not able to provide the boot instructions for any of a number of reasons, causing the 630 processor to operate in a loop waiting for instructions to execute. Presently, there is no way for the hang detection mechanism to isolate which FRU(s) are responsible for system hangs during processor operation.
The present invention recognizes that it would therefore be desirable to provide a method, system, and program product that isolates faults identified during boot-up and/or operation of a data processing system to a correct field replaceable unit (FRU). The invention further realizes that it would be time saving if the method, system, and program product utilized the boot progress indicators of the Initial Program Loader (IPL) to complete the fault isolation procedure.
SUMMARY OF THE INVENTION
Described is a method, system, and program product for isolating faults to a correct field replaceable unit (FRU) of a data processing system utilizing the boot progress indicators of the initial program loader (IPL). A fault isolation logic is associated with the hang detection mechanism of the data processing system's processor. The hang detection mechanism monitors the processor for a timeout, i.e., when the processor “hangs.” When a timeout occurs, the fault isolation logic is triggered and checks the boot record to determine if the timeout occurred because of an FRU fault before or after the service processor completed system initialization. The result of the check is outputted to a user/administrator. When the timeout condition occurred because of an error while the service processor was loading operating system (OS) (e.g., AIX) instructions from the boot device in the input/output (I/O) subsystem, then the FRU fault is indicated to be a boot fault associated with the I/O planar and the processor card. When the FRU fault occurred prior to fetching the OS instructions from the boot device and transferring control to system firmware or after the service processor completed its system initialization procedures, (i.e., when the system firmware began initializing the hardware and the processor began operating), then the fault is attributed to the processor card and backplane. Attributing boot error faults to incorrect FRUs is therefore substantially eliminated.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.


REFERENCES:
patent: 5245615 (1993-09-01), Treu
patent: 5530946 (1996-06-01), Bouvier et al.
patent: 5867702 (1999-02-01), Lee
patent: 6216226 (2001-04-01), Agha et al.
patent: 6243823 (2001-06-01), Bossen et al.
patent: 6453429 (2002-09-01), Sadana
patent: 6550019 (2003-04-01), Ahrens et al.
patent: 6587963 (2003-07-01), Floyd et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Identifying field replaceable units responsible for faults... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Identifying field replaceable units responsible for faults..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Identifying field replaceable units responsible for faults... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3196815

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.