Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2001-03-08
2004-11-23
Le, Dieu-Minh (Department: 2113)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S020000, C714S048000
Reexamination Certificate
active
06823482
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates to managing partitioned systems. More particularly, the present invention relates to a system and method for reporting platform errors that are detected by the platform and reported to more than one partition within a computer system.
2. Description of the Related Art
Logical partitioning is the ability to make a single multiprocessing system run as if it were two or more independent systems. Each logical partition represents a division of resources in the system and operates as an independent logical system. Each partition is logical because the division of resources may be physical or virtual. An example of logical partitions is the partitioning of a multiprocessor computer system into multiple independent servers, each with its own processors, main storage, and I/O devices. One of multiple different operating systems, such as AIX, LINUX, or others, can be running in each partition.
In a Logically Partioned (LPAR) multiprocessing system, there are a class of errors (Local) that are only reported to the assigned or owning partition's operating system. Failures of I/O adapters which are only assigned to a single partition's operating system are an example of this. There is also another class of errors (Global) that are reported to each partition's operating system because they could potentially affect each partition's operation. Examples of this type are power supply, fan, memory, and processor failures.
Logical partitioning is in common use today because it provides its users with flexibility to change the number of logical partitions in use and the amount of physical system resources assigned to each partition, in some cases while the entire system continues to operate. Logical partitioning is also used because certain applications or work environments may require a particular operating system.
For example, in a home-based business, a particular business application may be written for IBM's AIX® operating system, while another home application may be written for Microsoft “Windows” operating system (such as Windows 98® or Windows 2000®). Rather than having separate computer systems for the various operating systems and applications, logical partitions allow the different applications and operating systems to be executed on the same physical machine. All of the operating systems can be loaded on one or more nonvolatile storage devices, such as hard disk drives (HDD), accessible by the computer system.
In some system environments, diagnostics are executed on the computer system periodically to determine whether the computer system requires maintenance. Services are provided to automatically receive reports from computer systems detailing the maintenance required. The diagnostic software is often included with the operating systems. Because each of the operating systems is using the same underlying hardware, the diagnostics for each operating system in a logically partitioned system is likely to detect and report the same error. In an automated service environment, having multiples of the same errors reported may cause confusion and inefficiencies when servicing the systems. For example, if the AIX operating system detected that a firmware card within the computer was failing, it may send a report to one service organization to install a replacement card in the system. At the same time, another operating system loaded in the machine may report the same problems causing either the same service organization or a different service organization to take action to replace the defective card.
What is needed, therefore, is a way of efficiently noting whether a hardware error has already been reported to one of the operating systems installed on a logically partitioned system.
SUMMARY
It has been discovered that a flag can be used to detect when a hardware error has already been reported to prevent duplicate servicing of the same hardware component. Computer system hardware and firmware cards have multiple components for providing a particular function, such as a video display and communications, to the user. One of these components is a firmware error buffer where information identifying errors that have been detected in hardware are stored. In addition to the error identifiers, an Already Reported Flag (ARF) is included to indicate whether the error has been reported to at least one operating system.
When an error is first reported, the ARF is set to “no” (i.e., “0”). After the first operating system requests error information and receives the error identifier, the ARF is set to “yes” (i.e., “1”), indicating that the corresponding error has been provided to one of the operating systems. Subsequently, when another operating system requests error information and retrieves the errors stored in the error buffer, the ARF will be used to indicate that the particular error has already been reported to one of the operating systems.
When the operating system retrieves the errors using diagnostics, it will create a report of detected errors in order to take corrective action to repair or maintain the computer system. For example, the errors with the ARF set to “no” can be highlighted to inform the user or service organization that these errors are newly reported. On the other hand, the report may note which errors have previously been reported so that a service or individual does not replace a component more than once.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.
REFERENCES:
patent: 5748884 (1998-05-01), Royce et al.
patent: 5805790 (1998-09-01), Nota et al.
patent: 5872970 (1999-02-01), Pickett et al.
patent: 5878205 (1999-03-01), Greenstein et al.
patent: 5892898 (1999-04-01), Fujii et al.
patent: 6021262 (2000-02-01), Cote et al.
patent: 6298457 (2001-10-01), Rachlin et al.
patent: 6615374 (2003-09-01), Moran
patent: 2002/0124213 (2002-09-01), Ahrens et al.
patent: 2002/0124214 (2002-09-01), Ahrens et al.
patent: 2002/0124215 (2002-09-01), Austen et al.
Ahrens George Henry
Benignus Douglas Marvin
Tysor Arthur James
Le Dieu-Minh
Leeuwen Joseph T. Van
Roberts Diana L.
Wilson Yolanda L
LandOfFree
System and method for reporting platform errors in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for reporting platform errors in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for reporting platform errors in... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3340819