Special encoding of known bad data

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S054000

Reexamination Certificate

active

06662319

ABSTRACT:

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
Not applicable.
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to a multi-processor system. More particularly, the invention relates to the detection of corrupted data in a multi-processor system. More particularly still, the invention relates to the detection of corrupted data and replacement of the corrupted data with a predetermined value to indicate to the rest of the system that a transmission error has occurred that has already been detected.
2. Background of the Invention
It often is desirable to include multiple processors in a single computer system. This is especially true for computationally intensive applications and applications that otherwise, can benefit from having more than one processor simultaneously performing various tasks. It is not uncommon for a multi-processor system to have 2 or 4 or more processors working in concert with one another. Typically, each processor couples to at least one and perhaps three or four other processors.
Such systems usually require data and commands (e.g., read requests, write requests, etc.) to be transmitted from one processor to another. For the data or commands to pass from the source to the destination, the transmission may have to pass through one or more intervening processors interconnecting the source and the destination processors. Accordingly, messages can be passed from one processor to another and another with the intervening processors simply forwarding the message on to the next processor in the communication path.
A desirable feature of such systems is to be able to detect the presence of corrupted data and, if possible, correct the corrupted data. A data packet might have one of its bits reverse logic state (i.e., switch from a 0 to a 1 or 1 to a 0) at some point between the source and the destination. Further, more than one bit in a data packet might improperly change state. A single bit in a data packet that becomes corrupted is referred to as a “single bit error” and more than one bit becoming corrupted is a “multi-bit error.” There are a variety of causes of such corruption. For example, cosmic radiation can change the state of individual gates causing a bit to change state. Further, it is possible for electromagnetic interference generated by nearby electronics to effect the electrical state gates in a multi-processor system. Regardless of the source of the data corruption it is desirable to be able to detect that the corruption has occurred and, if possible, correct the problem.
A variety of error detection schemes have been suggested and used. Some techniques are capable of only detecting single bit errors, while other techniques can detect double-bit errors. Further, some techniques also include error correction to permit the corrupted bit or bits to be corrected. Such error correction techniques generally require detecting, not only that an error has occurred, but also the identification of which bit(s) is erroneous. Some systems will be able to detect that a multi-bit error has occurred, but not be able to determine which bit is erroneous and thus be unable to correct the problem. There is a tradeoff between the capabilities of an error detection and correction scheme and its complexity. For instance, single bit detection and correction schemes are generally less complex than multi-bit error detection and correction schemes but cannot correct more than one corrupted bit at a time.
Whatever type of error detection and correction scheme is chosen for implementing in a given multi-processor system, a problem still remains as to what to do with those errors that can be detected, but not corrected. In conventional systems, there generally have been two choices. On one hand, the message containing the detected, but uncorrectable, error can be halted and not retransmitted to the next processor in the communication path. This approach advantageously isolates the error, but can cause the system to “deadlock” meaning that the system generally becomes unusable. Deadlock can occur when future tasks that the processors are to perform are contingent upon a particular data message. If that message is stopped due to a corrupted bit or bits, the system will not be able to determine what action to perform next.
Alternatively, the message with the corrupted bit can be forwarded on to the next processor in the communication link. Deadlock is avoided in this case, as the message is sent. However, each processor that receives the message will detect the error and signal an error event (typically by asserting an error flag). For a message with an error that passes through
10
processors, all
10
processors will signal an error. With
10
processors all indicating the same error, error isolation becomes problematic. That is, determining the source of the error becomes difficult, if not impossible.
Accordingly, a need exists to efficiently and effectively handle errors in a multi-processor system that can detect, but not necessarily correct the error. Such a system should be able to detect the error, preclude the system from becoming deadlocked and permit the error to be efficiently isolated. To date, no such system is known to exist.
BRIEF SUMMARY OF THE INVENTION
The problems noted above are solved in large part by a multi-processor system in which each processor can receive a message from one or more other processors in the system. The message may contain corrupted data that was corrupted during transmission from the preceding processor. Upon receiving the message, the processor detects that a portion of the message contains corrupted data. The processor then replaces the corrupted portion with a predetermined bit pattern that is known to or otherwise programmed into all other processors in the system. The predetermined bit pattern indicates that a data transmission error has occurred in the corresponding portion of the message. The processor that detects the error in the message preferably alerts the system, for example by setting an error flag, that an error has been detected. The message now containing the predetermined bit pattern in place of the corrupted data can be retransmitted to another processor. The predetermined bit pattern will indicate that an error in the message was detected by the previous processor. In response, the processor detecting the predetermined bit pattern preferably will not alert the system of the existence of an error. The same message with the predetermined bit pattern then can be retransmitted to other processors which also will detect the presence of the predetermined bit pattern and in response not alert the system of the presence of the error. As such, because only the first processor to detect an error alerts the system of the error and because messages containing uncorrectable errors still are transmitted through the system, fault isolation is improved and the system is less likely to fall into a deadlock condition.
Each processor preferably includes a memory controller for connection to a memory device, an interface to an input/output controller, a router for connection to one or more other processors, and other components. The router transmits and receives messages to and from other processors in the system. The router also detects transmission errors and replaces the erroneous portion with the predetermined bit pattern.
Each message preferably includes multiple “ticks” of data with each tick comprising multiple bits of information including error check bits. The error check bits permit the router to detect transmission errors and may permit correction of the erroneous bits. Some types of errors, however, are uncorrectable given the number of error check bits. These uncorrectable errors can be detected but cannot be corrected. Upon detecting an uncorrectable error in a tick, the router replaces all of the bits in the corrupted tick with the predetermined bit pattern. Data ticks include multiple data bits and multiple error check bits. An exemplary predetermined bit patte

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Special encoding of known bad data does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Special encoding of known bad data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Special encoding of known bad data will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3165552

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.