Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1999-10-19
2003-05-27
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S040000
Reexamination Certificate
active
06571360
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and, more particularly, to the testing of an I/O Board that is connected to a multiprocessing computer system.
2. Description of the Related Art
Multiprocessor computer systems include two or more processors that may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device that executes programmed instructions to produce desired output signals, often in response to user-provided input data.
A popular architecture in commercial multiprocessor computer systems is the symmetric multiprocessor (SMP) architecture. Typically, an SMP computer system comprises multiple processors each connected through a cache hierarchy to a shared bus. Additionally connected to the shared bus is a memory, which is shared among the processors in the system. Access to any particular memory location within the memory occurs in a similar amount of time as access to any other particular memory location. Since each location in the memory may be accessed in a uniform manner, this structure is often referred to as a uniform memory architecture (UMA).
Another architecture for multiprocessor computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes that each include one or more processors and some local memory. The multiple nodes are coupled together by a network. The memory included within the multiple nodes, when considered as a collective whole, forms the shared memory for the computer system.
Distributed shared memory systems are more scaleable than systems with a shared bus architecture. Since many of the processor accesses are completed within a node, nodes typically impose much lower bandwidth requirements upon the network than the same number of processors would impose on a shared bus. The nodes may operate at high clock frequency and bandwidth, accessing the network only as needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.
Because of their high performance, multiprocessor computer systems are used for many different types of mission-critical applications in the commercial marketplace. For these systems, downtime can have a dramatic and adverse impact on revenue. Thus system designs must meet the uptime demands of such mission critical applications by providing computing platforms that are reliable, available for use when needed, and easy to diagnose and service.
One way to meet the uptime demands of these kinds of systems is to design in fault tolerance, redundancy, and reliability from the inception of the machine design. Reliability features incorporated in most multiprocessor computer systems include environmental monitoring, error correction code (ECC) data protection, and modular subsystem design. More advanced fault tolerant multiprocessor systems also have several additional features, such as full hardware redundancy, fault tolerant power and cooling subsystems, automatic recovery after power outage, and advanced system monitoring tools.
For mission critical applications such as transaction processing, decision support systems, communications services, data warehousing, and file serving, no hardware failure in the system should halt processing and bring the whole system down. Ideally, any failure should be transparent to users of the computer system and quickly isolated by the system. The system administrator must be informed of the failure so remedial action can be taken to bring the computer system back up to 100% operational status. Preferably, the remedial action can be made without bringing the system down.
In many modern multiprocessor systems, fault tolerance is provided by identifying and shutting down faulty processors and assigning their tasks to other functional processors. However, faults are not limited to processors and may occur in other portions of the system such as, e.g., interconnection traces and connector pins. While these are easily tested when the system powers up, testing for faults while the system is running presents a much greater challenge. This may be a particularly crucial issue in systems that are “hot-swappable”, i.e. systems that allow boards to be removed and replaced during normal operation so as to permit the system to be always available to users, even while the system is being repaired.
Examples of hardware components that can be hot-swapped in some systems include microprocessor boards, memory boards, and I/O boards. A microprocessor board may typically contain multiple microprocessors with supporting caches. I/O boards typically contain I/O ports for coupling the system to various peripherals. The I/O ports may take the form of expansion slots configured according to any one of many different busing standards such as PCI, SBus or EISA.
Ideally, as one of these hardware components is installed in the multiprocessing computer system, the component should be automatically tested to detect faults before being inducted into the system. Hardware components that include a microprocessor can be configured to do an automatic self-test. However, typical I/O boards do not include microprocessors, and it is undesirable to add a microprocessor to an I/O board solely for the purpose of performing an automatic self-test.
SUMMARY OF THE INVENTION
The problems outlined above are in large part solved by a multiprocessing computer system employing an error cage for dynamic reconfiguration testing of a I/O board as it is being connected. In one embodiment, the multiprocessing computer system provides the hardware support to properly test an I/O Board while the system is running user application programs. The hardware support also prevents a faulty board from causing a complete system crash. The multiprocessor computer system includes a centerplane that mounts multiple expander boards. Each expander board in turn connects a processor board and an I/O board to the centerplane.
During operation of the multiprocessing computer system, a hot-swapped I/O board is first electrically coupled to the system as it is inserted into the expander board. The I/O board is then tested, and if it passes, incorporated logically into the running multiprocessor computer system and allowed to execute the operating system and application programs for users.
In one embodiment, testing of the I/O board proceeds in the following manner. Prior to testing, the I/O board is logically connected to the target domain as it would be for a dynamic reconfiguration attach. A process using a microprocessor and memory on a microprocessor board performs hardware testing of the I/O board. A hardware failure cage, address transaction cage, and interrupt transaction cage, isolate any errors generated while the I/O board is being tested. The hardware failure cage isolates error correction code errors, parity errors, protocol errors, timeout errors and other similar errors generated by the I/O board under test. The address transaction cage isolates out-of-range memory addresses from the I/O board under test. The interrupt transaction cage isolates incorrect interrupt requests generated by the I/O board under test. After testing is complete the I/O board is logically disconnected from the domain as it would be for dynamic reconfiguration detach. Any errors generated by the I/O board while being tested are logged by the hardware for possible retrieval by the system controller.
The preferred system and method prevents a faulty I/O board from causing errors in the isolated portion of the system, thereby shielding ongoing user applications and preventing any system crashes that might result from propagation of in
Drogichen Daniel P.
Graf Eric Eugene
Kane Don
Meyer Douglas B.
Phelps Andrew E.
Beausoliel Robert
Bonzo Bryce P.
Kivlin B. Noäl
Meyertons Hood Kivlin Kowert & Goetzel P.C.
Sun Microsystems Inc.
LandOfFree
Cage for dynamic attach testing of I/O boards does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Cage for dynamic attach testing of I/O boards, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Cage for dynamic attach testing of I/O boards will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3025373