Recovery from data fetch errors in hypervisor code

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S011000, C709S241000, C712S013000, C712S228000

Reexamination Certificate

active

06658591

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates generally to the field of computer architecture and, more specifically, to methods and systems for managing resources among multiple operating system images within a logically partitioned data processing system.
2. Description of Related Art
A logical partitioning option (LPAR) within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system platform. A partition, within which an operating system image runs, is assigned a non-overlapping sub-set of the platform's resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and I/O adapter bus slots. The partition's resources are represented by its own open firmware device tree to the OS image.
Each distinct OS or image of an OS running within the platform are protected from each other such that software errors on one logical partition cannot affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images cannot control any resources that have not been allocated to it. Furthermore, software errors in the control of an OS's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
One means for separating the partitions is managed by a firmware component; such as, for example the hypervisor within an RS/6000 platform, a product of International Business Machines Corporation of Armonk, N.Y. Hardware errors that are fatal to this firmware component become fatal for the entire platform, thus, bringing down the entire system. One major hardware error that may affect the hypervisor is a data fetch unrecoverable memory error (DfetchUE). The Risc system 6000 memory, within the RS/6000, is single bit error correction code protected, that is, hardware is able to correct any single bit error by special redundancy codes. However, currently, multi-bit errors cannot be corrected, but may only be detected. Multi-bit errors, while rare, occur due to a variety of conditions. Therefore, a method, system, and apparatus for recovering and isolating errors affecting the hypervisor is desirable.
SUMMARY OF THE INVENTION
The present invention provides a method, system, and apparatus for isolating fatal data fetch errors to a single partition within a logically partitioned data processing system. In one embodiment, the logically partitioned data processing system includes a plurality of operating systems and a plurality of processors. Each of the operating systems is assigned to a separate one of a plurality of logical partitions. Each of the processors is assigned to one of the plurality of logical partitions. The logically partitioned data processing system also includes a hypervisor for creating and maintaining separation of the plurality of logical partitions. The hypervisor contains services and functions accessed by each of the logical partitions, and to prevent fatal data fetch errors in one partition from effecting other partitions within the logically partitioned data processing system, the hypervisor includes a plurality of data structure areas. Fatal data fetch errors occurring in one of the plurality of data structure areas results in rebooting data processing system components associated with only a single effected logical partition of the plurality of logical partitions within the logically partitioned data processing system.


REFERENCES:
patent: 5345590 (1994-09-01), Ault et al.
patent: 5659756 (1997-08-01), Hefferon et al.
patent: 5805790 (1998-09-01), Nota et al.
patent: 5872907 (1999-02-01), Griess et al.
patent: 6256748 (2001-07-01), Pinson
patent: 6374363 (2002-04-01), Wu et al.
patent: 6381682 (2002-04-01), Noel et al.
patent: 6421679 (2002-07-01), Chang et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Recovery from data fetch errors in hypervisor code does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Recovery from data fetch errors in hypervisor code, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Recovery from data fetch errors in hypervisor code will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3111713

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.