Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-04-29
2004-01-27
Myers, Paul R. (Department: 2181)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S010000, C714S011000, C710S306000, C710S063000, C713S152000
Reexamination Certificate
active
06684343
ABSTRACT:
TECHNICAL FIELD
This invention relates in general to multi-processor computer systems, and in specific to a service processor that supports a multi-processor computer system.
BACKGROUND
Prior computer platforms have been symmetric multi processors (SMP) arrangements where multiple CPUs are running a single copy of the operating system (OS). The OS provides time sharing services to allow multiple applications to run. However, this arrangement permits the applications to interfere with each other. For example, if the system is running an accounting application, the accounting application can allocate all the memory in the system, as well as use all the processors that the OS can allocate. Then, when some other application, for example a manufacturing application, would not be able to allocate any memory or processors for its needs, and therefore would freeze. Thus, the manufacturing application would have been frozen or impacted by the accounting application. This arrangement also leaves the system vulnerable to failures. Any problem with one application would corrupt the resources for all applications.
A known solution to this problem is to separate the computer system into partitions. These partitions are hardware separations which place resources into separate functional blocks. Resources can be flexibly assigned to partitions. Resources in one block do not have direct access to resources in another block. This prevents one application from using the entire system resources, as well as contains faults and errors. An example of such a system is the Sun Microsystems UE10K.
This solution presents its own problem, namely system support and management. The assignment of hardware resources to partitions is flexible, and yet the user needs to be able to monitor and control the operation of the multiple partitions and the hardware assigned to them, to perform operational and diagnostic and debugging functions. These capabilities need to be provided regardless of the configuration of the partitions.
SUMMARY OF THE INVENTION
These and other objects, features and technical advantages are achieved by a system and method which allows for management of the partitions and the hardware that they run on.
Several terms are defined in this paragraph which are necessary to understand the concepts underlying the present invention. A complex is a grouping of one or more cabinets containing cell boards and I/O, each of which can be assigned to a partition. Partitions are groupings of cell boards, with each partition comprising at least one cell. Each partition would run its own copy of system firmware and the OS. Each cell board can comprise one or more system CPUs together with system memory. Each cell board can optionally have I/O connected to it. Each partition must have at least enough I/O attached to its cell board(s) to be able to boot the OS. I/O (Input/Output subsystem) comprises an I/O backplane into which I/O controllers (e.g. PCI cards) can be installed, and the I/O controllers themselves. Cell boards in each cabinet are plugged into a backplane that connects them to the fabric. The fabric is a set of ASICs that allow the cells in a partition to communicate with one another, potentially across cabinet boundaries.
Cell boards are connected to I/O controllers in such a way that software or firmware running on a partition can operate the I/O controllers to transfer data between system memory and external disks, networks, and other I/O devices. One particular type of I/O controller is special—the Core I/O controller—which provides the console interface for the partition. Every partition must have at least one Core I/O controller installed. A complex has exactly one service processor.
Thus, with a multiple partition system, multiple copies of the OS are running independently of each other, each in a partition that has its own cell boards with processors and memory and connected I/O. This provides isolation between different applications. Consequently, a fatal error in one partition would not affect the other partitions.
A service processor is used to manage the partitions and the hardware they run on. For certain operations, and external system (e.g. a workstation or a PC) augments the service processor, and works with the service processor to provide certain diagnostic features, namely firmware update and scan diagnostic functions. The user interacts with service processor and/or the external system to perform the management and control functions.
A network of micro-controllers connected to the service processor, via a communications link, provides the service processor with information on each of the different cells, as well as a pathway to command changes in the different cells or I/O.
Therefore it is a technical advantage of the present invention to provide access to system control features such as power on/off, status display etc for multiple cabinets under control of the service processor.
It is another technical advantage of the present invention to allow commands which reference partitions (e.g. reset) to act on the collection of cells that form the partition. Partitions are referenced by the partition name and the service processor replicates commands such as reset to all the affected cells.
It is a further technical advantage of the present invention to provide security features to limit access to the system to authorized users.
It is a still further technical advantage of the present invention to prevent misconfiguring the system such that the power requirements of installed HW would exceed the capacity of the installed power supplies.
It is a still further technical advantage of the present invention to report the power & environmental status of the complex comprising multiple cabinets.
It is a still further technical advantage of the present invention to provide JTAG scan capability for multiple cabinets from a single network drop which connects to an external workstation which runs the scan diagnostic.
It is a still further technical advantage of the present invention to provide live display of log events, which can be optionally filtered to include only those from a selected partition.
It is a still further technical advantage of the present invention to provide a live display showing the boot or run state of all the partitions and of all the cells.
It is a still further technical advantage of the present invention to receive log events from multiple partitions and store them in non-volatile memory.
It is a still further technical advantage of the present invention to reflecting log events from all partitions back to the partitions for storage.
It is a still further technical advantage of the present invention to provide OS and system FW debugging capability without requiring additional HW.
It is a still further technical advantage of the present invention to provide a method to update both utilities FW and system FW. System FW must be able to be updated even when the cell is not part of a partition and is unbootable.
It is a still further technical advantage of the present invention to provide all the above features using a low-cost embedded service processor.
REFERENCES:
patent: 5113523 (1992-05-01), Colley et al.
patent: 5163052 (1992-11-01), Evans et al.
patent: 5175824 (1992-12-01), Soderbery et al.
patent: 5694607 (1997-12-01), Dunstan et al.
patent: 5701440 (1997-12-01), Kim
patent: 5781434 (1998-07-01), Tobita et al.
patent: 5907670 (1999-05-01), Lee
patent: 5948087 (1999-09-01), Khan et al.
patent: 6088770 (2000-07-01), Tarui et al.
patent: 6112271 (2000-08-01), Lanus et al.
patent: 6243838 (2001-06-01), Liu et al.
patent: 6279046 (2001-08-01), Armstrong et al.
patent: 6289467 (2001-09-01), Lewis et al.
patent: 6378021 (2002-04-01), Okazawa et al.
patent: 6378027 (2002-04-01), Bealkowski et al.
patent: 6453344 (2002-09-01), Ellsworth et al.
patent: 6516372 (2003-02-01), Anderson et al.
Bouchier Paul H.
Delmonte Janis
Gilbert Jr. Ronald E.
Hasty Robert Alan
Koerber Christine
Hewlett-Packard Development Company LP.
Myers Paul R.
Phan Raymond N
LandOfFree
Managing operations of a computer system having a plurality... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Managing operations of a computer system having a plurality..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Managing operations of a computer system having a plurality... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3192907