Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-06-15
2001-07-10
Trammell, James P. (Department: 2161)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S043000, C714S056000, C710S120000
Reexamination Certificate
active
06260159
ABSTRACT:
BACKGROUND OF THE INVENTION
This invention relates to a multi-processor computer system including first and second processing sets (each of which may comprise one or more processors) which communicate with an I/O device bus.
The application finds particular application to fault tolerant computer systems where two or more processor sets need to communicate with an I/O device bus in lockstep with provision for identifying lockstep errors in order to identify faulty operation of the system as a whole.
In such a fault tolerant computer system, an aim is not only to be able to identify faults, but also to provide a structure which is able to provide a high degree of system availability. In order to provide high levels of system availability, it would be desirable for such systems to automatically attempt recovery from a lockstep error.
As part of such an automatic recovery process it is necessary to reintegrate the state of the processing sets to a common status in order to attempt a restart in lockstep. An approach to achieving this is to copy the complete state of one of the processing sets (i.e. the “good” one) to the other processing set. This involves ensuring that the content of the memory of both processors is the same before trying a restart in lockstep mode.
However, a problem with the copying of the content of the memory from one processing set to the other is that during this time devices connected to the I/O bus may be making direct memory access (DMA) to the memory of the processing set(s). If a write is made to an area of memory which has already been copied, this would result in the memory state in the processing sets at the end of the copy not being the same.
It has been proposed to employ a dirty RAM in a processor to indicate areas of memory which have been changed since the dirty RAM was last reset. A dirty RAM is a bit map having a bit for each block, or page, of memory, which bit is set when a write access to the area of memory concerned is made. However, the provision of a dirty RAM in the processing sets would not provide a reliable solution to the problem of reinstating the memory of the processor because of the difficulties and delays in accessing the dirty RAM of other processing sets.
An aim of the present invention is to provide a solution to the problem of addressing direct memory accesses in achieving reinstatement of a concurrent state in first and second processing sets.
SUMMARY OF THE INVENTION
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Combinations of features from the dependent claims may be combined with features of the independent claims as appropriate and not merely as explicitly set out in the claims.
In accordance with one aspect of the invention, there is provided a bridge for a multi-processor system. The bridge comprises bus interface for connection to an I/O bus of a first processing set, an I/O bus of a second processing set and a device bus. A bridge control mechanism is operable to permit direct memory access to memory of the processing sets by a device on the device bus, to arbitrate between the first and the second processing sets for access to the bridge in a first, split, mode, and to monitor lockstep operation of the first and second processing sets in a second, combined, mode. A dirty RAM mechanism is provided in the bridge for monitoring regions of processor set memory modified by direct memory accesses by the device on the device bus.
An embodiment of the invention is thus able to monitor parts of memory modified by DMA operations initiated by a device on the device bus. By providing a dirty RAM mechanism in a bridge, this facilitates access to the dirty RAM by the processing sets. The reintegration process can involve a number of passes, during each of which passes dirtied memory is copied from a good processing set to a faulty (target) processing set or sets. During the process of re-integration the good processing set can access the dirty RAM to determine the parts of the memory which have been dirtied (in either its own or the target processing set's memory) to be copied on any pass.
It should be noted that the bus interfaces referenced above need not be separate components of the bridge, but may be incorporated in other components of the bridge, and may indeed be simply connections for the lines of the buses concerned.
In an embodiment of the invention, the dirty RAM mechanism defines a dirty indicator (e.g., a bit) for each of a plurality of regions of processing set memory, a dirty indicator being set to a predetermined value when the region of memory has been written to by a DMA access.
The processing sets can be configured such that one of the processing sets is operable in the split mode as a primary processing set and to copy the content of its memory to the other processing set(s). If during this copy operation some of the regions of the memory are written to by a direct memory access, the state at the end of the copy operation will not be the same in the various processing sets. As a result the primary processing set re-copies those regions of its memory which have been marked in the dirty RAM mechanism as having been written to by virtue of the corresponding dirty indication being set. This process can be repeated in a number of passes as required.
In an embodiment of the invention, the bridge control mechanism comprises an arbiter connected to the first and second processor bus interfaces and to the device bus interface, the arbiter being configured to be operable in the split mode to arbitrate for use of the bridge by the first and second processing sets and devices on the device bus. The bridge control mechanism is configured to be operable to respond to a synchronization reset operation from the primary processing set, on completion of copying the content of the memory regions identified in the dirty RAM mechanism with no further regions having being so identified, to transfer from the split mode of operation to the combined mode of operation.
The dirty RAM mechanism can comprise a dirty RAM configured in random access memory in the bridge. Alternatively, a separate hardware memory device may be provided. The content of the dirty RAM can be cleared on being read by a processing set. Alternatively, two dirty RAMs can be provided, the two dirty RAMs being operable in a toggle mode with one being written to while the other is being read. Optionally, a respective dirty RAM could be provided for each processing set.
There may be more than two processor bus interfaces for connection to I/O buses of respective processing sets.
In accordance with another aspect of the invention, there is provided a computer system comprising a first processing set having an I/O bus, a second processing set having an I/O bus, a device bus, at least one device on the device bus and a bridge as set out above. Each processing set may comprise at least one processor, memory and a processing set I/O bus controller.
In accordance with a further aspect of the invention, there is provided a method of operating a multi-processor system as set out above, the method comprising:
permitting direct memory access to memory of the processing sets by the at least one device on the device bus; and
monitoring, in a dirty RAM in the bridge, regions of processor set memory written to by the device on the device bus.
A method of re-integration can involve multiple passes of copying areas of memory from a first processing set to a second processing set, the areas to be copied being identified by the areas memory for which corresponding dirty RAM bit is set.
The re-integration method can include a set of preventing direct memory access to restart in a combined, or lockstep, mode.
REFERENCES:
patent: 4503535 (1985-03-01), Budde et al.
patent: 4916704 (1990-04-01), Bruckert et al.
patent: 5255367 (1993-10-01), Bruckert et al.
patent: 5339408 (1994-08-01), Bruckert et al.
patent: 5627965 (1997-05-01), Liddell et al.
patent: 6065135 (2000-05-01), Marshall et al.
patent: 6148348 (2000
Garnett Paul J.
Oyelakin Femi A.
Rowlinson Stephen
Conley Rose & Tayon PC
Elisca Pierre Eddy
Kivlin B. Noäl
Sun Microsystems Inc.
Trammell James P.
LandOfFree
Tracking memory page modification in a bridge for a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Tracking memory page modification in a bridge for a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Tracking memory page modification in a bridge for a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2476851