Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-04-29
2003-11-25
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S019000, C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06654908
ABSTRACT:
TECHNICAL FIELD
This application relates in general to computer systems and more specifically to error registers shared and accessed by multiple requestors in a multiprocessor system.
BACKGROUND
Many computer systems use multiple processors to identify solutions faster or to address more complex problems. A typical, state of the art multiprocessor system is described, for example, in U.S. Pat. No. 6,049,801 entitled “Method of and Apparatus for Checking Cache Coherency in a Computer Architecture”, and U.S. Pat. No. 5,859,975 entitled “Parallel Processing Computer System Having Shared Coherent Memory and Interconnections Utilizing Separate Unidirectional Request and Response Lines for Direct Communication of Using Crossbar Switching Device”, both patents are assigned to the owner of the present invention, and are incorporated herein in their entirety. A multiprocessor computing system as described therein contains several compute elements each of which includes at least one processor and may include dedicated or shared distributed or central memory and input/output (I/O). These compute elements are connected to each other via an intercommunication fabric. The intercommunication fabric allows the various compute elements to exchange messages, share data, and coordinate processing. When an error occurs in this intercommunication fabric the error is detected and recorded in an error log register located in the intercommunication fabric.
It is important that the information contained in the error log register is forwarded to the user of the multiprocessor system. However, retrieval and display of this information is complicated by a number of factors. First, a dedicated error register reading compute element may not be practical because not all errors may be visible to each of the compute elements, and compute elements may be added or removed from the system during operation. Secondly, compute elements in a system are unaware of each other until they make contact via the intercommunication fabric and the error itself may disrupt or prevent communications between the various compute elements. Third, errors themselves occur with varying frequency and a specific error log only contains information concerning a limited number of errors, typically only a single error. Fourth, an error register is typically sized to contain information relating to a single error and successive error information is lost until the error register is read by a compute element and made ready to store subsequent error events. Each compute element is therefore interested in reporting errors as quickly as possible. Conflicts between competing compute elements to read and make error register content accessible are inevitable.
Normally the error log register cannot be read in a single access by any of the compute elements i.e. the operation is non-atomic, requiring several read cycles. A compute element must therefore retrieve all of the information in the error log register through multiple accesses. Normally a flag or a status register indicates that an error has been captured and stored in the error log register. Once the status register has been set, a compute element begins to access the information in the error log register and continues accessing that information until all of the error information has been retrieved. Once all of the information has been retrieved, the compute element then clears the status flag. However, in a multiprocessor environment wherein the error log register is shared, problems develop when compute elements compete to read the information stored in the error log register.
Such contention problems may come about as follows. If compute element A detects that the status flag is set, it begins to read the information from the error log register. Subsequently compute element B may also detect that the error flag is set. Compute element B would then begin to read the information stored in the error log register. Normally compute element A would complete its reading of the information stored in the error log register and clear the status register before compute element B has completed its reading of the error log register. Upon completion of compute element B's reading of the error log register, compute element B would notice the status register was no longer set and would discard the information. However, if a second error should occur after compute element A clears the flag and before compute element B completed its reading of the information in the error log register, compute element B's retrieved information would then contain part of the log of the first error and part of the log of the second error and would be invalid. Even though compute element B would check the status register to ensure the data is valid, the status register would have been reset by the second error and compute element B would believe that this information was valid. Compute element B obtains the invalid log because compute element A cleared the original error and a second error occurred before compute element B completed its retrieval of the error information. Compute element B would then pass invalid information to the user.
A prior method of solving this problem used a hardware semaphore to coordinate the retrieval of information of the error log registers between compute element A and compute element B. A hardware semaphore can be configured to ensure that only one compute element was accessing the information stored in the error log register at a time. However, the use of hardware semaphores have several disadvantages. One such disadvantage is that it is possible that after a compute element coordinates with a hardware semaphore to access an error log register, the compute element may begin to access the error log register and then encounter an error so that it cannot complete its access of the error log register. As long as that compute element retains control of the hardware semaphore, no other compute elements could then access the error log register in question. An additional mechanism would then be required to recover the lost semaphore so that the error log register information could be read and passed to the user.
A second method of coordinating multiple compute elements access of the error log register uses a communication mechanism between the processors to coordinate the reading and clearing of error log registers. In a multiple compute element environment, with the compute elements communicating via the intercommunication fabric, this methodology is impractical because the error log register resides in the intercommunication fabric and an error may make the intercommunication fabric itself unavailable to support communications between compute elements.
A need therefore exists for a method and system which allows multiple compute elements to read and independently clear error register logs, discard invalid data and which ensures that the user receives information received in error log registers. A further need exists for a protocol which will ensure that the error log register is not cleared until its information is successfully retrieved by a compute element and that does not allow erroneous data to be accessed and used.
SUMMARY OF THE INVENTION
These and other objects, features and technical advantages are achieved by a system and method which according to one aspect of the invention, provides a token to ensure that related data is not altered or cleared during a reading of the data by another process. The token can be atomically read and uniquely identifies a log entry to be read but which cannot be read atomically and evaluated for change. The token may be implemented in the form of a counter corresponding to the log entry. The log entry may only be cleared using the token as a key. Error data may be stored as the log entry using the token as the key so that only previously read data is overwritten. Reading may also be performed using the log so that intervening processes cannot alter the data. This method may be used to ensure that only valid copies of error data are obtained. Accor
Lindsay Dean T.
Snyder Robert D.
LandOfFree
Method for and system producing shared usage of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for and system producing shared usage of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for and system producing shared usage of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3166684