Method for performing hierarchical hang detection in a...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S048000

Reexamination Certificate

active

06587963

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to a method of detecting which of a plurality of hardware devices in a computer system are failing, resulting in hanging of the computer system.
2. Description of Related Art
The basic structure of a conventional multi-processor computer system
10
is shown in FIG.
1
. Computer system
10
has several processing units, two of which
12
a
and
12
b
are depicted, which are connected to various peripheral devices, including input/output (I/O) devices
14
(such as a display monitor, keyboard, and permanent storage device), memory device
16
(such as random access memory or RAM) that is used by the processing units to carry out program instructions, and firmware
18
whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on. Processing units
12
a
and
12
b
communicate with the peripheral devices by various means, including a generalized interconnect or bus
20
. Computer system
10
may have many additional components which are not shown, such as serial and parallel ports for connection to, e.g., modems or printers. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of
FIG. 1
; for example, a display adapter might be used to control a video display monitor, a memory controller can be used to access memory
16
, etc. The computer can also have more than two processing units.
A processing unit includes a processor core
22
having a plurality of registers and execution units, which carry out program instructions in order to operate the computer. An exemplary processing unit includes the PowerPC™ processor marketed by International Business Machines Corp. The processing unit can also have one or more caches, such as an instruction cache
24
and a data cache
26
, which are implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from memory
16
. These caches are referred to as “on-board” when they are integrally packaged with the processor core on a single integrated chip
28
. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory.
A processing unit can include additional caches, such as cache
32
, which is referred to as a level
2
(L
2
) cache since it supports the on-board (level
1
) caches
24
and
26
. In other words, cache
32
acts as an intermediary between memory
16
and the on-board caches, and can store a much larger amount of information (instructions and data) than the on-board caches can, but at a longer access penalty. For example, cache
32
may be a chip having a storage capacity of 512 kilobytes, while the processor may be an IBM PowerPC™ 604-series processor having on-board caches with 64 kilobytes of total storage. Cache
32
is connected to bus
20
, and all loading of information from memory
16
into processor core
22
must come through cache
32
. Although
FIG. 1
depicts only a two-level cache hierarchy, multi-level cache hierarchies can be provided where there are many levels (L
3
, L
4
, etc.) of serially connected caches.
As computer systems have become more complex, it has contemporaneously become more difficult to determine the cause of computer malfunctions, in spite of extensive factory testing. Some malfunctions are more serious than others. For example, if an error occurs when a value is read from or written to the system memory device, a parity checking technique with built-in error control is often able to automatically correct the error, and the computer may continue operation with practically no noticeable interruption. More serious errors may generate interrupt signals which can temporarily delay computer processing. These interrupts can require various components to be reset, or may call interrupt handlers, monitoring routines or debugging software in order to deal with, and possibly determine the cause of, the problem.
In the most serious cases, a hardware failure can cause a computer component to halt operation, a fault condition referred to as a “hang.” When the component hangs, the entire computer system must usually be reset, that is, the power turned off and then back on again. This situation is not only inconvenient to users, but can further result in grievous loss of data, or crucial loss of control for an operation-critical system. These failures may arise either due to a soft error (a random, transient condition caused e.g., stray radiation or electrostatic discharge), or due to a hard error (a permanent condition, e.g., a defective transistor or interconnect line). One common cause of errors is a soft error resulting from alpha radiation emitted by the lead in the solder (C
4
) bumps used to form wire bonds with circuit leads.
It is accordingly important to be able to determine the true cause of a system failure (or as close as possible to the true cause) in order to address the problem and carry out appropriate repairs or replacement, as well as implement new engineering solutions for later manufacturing. However, in modern day systems having greater depth, when a computer access must go through several layers of devices to be serviced, it is often difficult or impossible to determine which component has caused the primary problem.
Consider for example a simple read operation. Referring to
FIG. 1
, a processor core such as
22
loads an instruction to retrieve (read) a particular data value (operand data) for further processing. In a problem-free system, when the processor executes the read operation, it passes the request down to data cache
26
. If data cache
26
does not hold a valid copy of the requested value, then the request is passed to the L
2
cache
32
. If the value is also not present at L
2
cache
32
, then the request is passed down in a similar manner to lower levels of the memory hierarchy (if additional cache levels are present), until it is received by system memory
16
. The value may not be in system memory, if it has temporarily been placed on a permanent storage device (hard disk drive, or HDD), e.g., in a “virtual memory” configuration. In such a case, the value must further be retrieved from the I/O device
14
. Once the value is located, it is passed back up the memory hierarchy and loaded into processor core
22
.
If any level in this access chain fails, then the entire system may hang. Under these circumstances, it is often unclear which component has actually caused the problem. It is sometimes necessary to have field diagnostics performed to determine the cause, which can be very expensive. Alternatively, several components might have to be replaced if the single failing component cannot be specifically identified. It would, therefore, be desirable to provide an improved method of indicating which component has caused a computer system to halt operation. It would be further advantageous if the method could allow a more accurate diagnostic call, or simplify debugging of the hang.
SUMMARY OF THE INVENTION
It is therefore one object of the present invention to provide an improved computer system.
It is another object of the present invention to provide an improved method of diagnosing operational problems in a computer system.
It is yet another object of the present invention to provide such a method which detects a primary component causing the computer system to hang.
The foregoing objects are achieved in a method of detecting a hang in a computer system, wherein the computer system includes a processing unit and a memory subsystem providing one or more access layers, generally comprising the steps of generating a plurality of hang strobe signals (including at least a first hang

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for performing hierarchical hang detection in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for performing hierarchical hang detection in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for performing hierarchical hang detection in a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3052281

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.