Error detection and correction method in a computer system...

Error detection/correction and fault detection/recovery – Pulse or data error handling – Digital data error correction

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06779148

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to a method of detecting a failure of a computer system, and to a main memory controller of computer systems. In particular, this invention relates to a technology that is effectively applied to an error detection and correction method, a main memory controller for computer systems, and a computer system preferably used to avoid a system failure derived from occurrence of an error and to specify an error source.
As a method of avoiding a system failure, when an uncorrectable error is detected in data to be written in a main memory over a CPU bus or an I/O bus, for example, Japanese Patent Laid-open No. 6-89196 has disclosed an approach described below. Namely, when a main memory controller detects an uncorrectable error in data transferred over a CPU bus or an I/O bus, certain received data is rewritten into data having a specific pattern. Check bits produced from the specific pattern data are all inverted. Consequently, data having all inverted check bits and being encoded according to a specific error correcting code is written in a main memory. When the data is read from the main memory, if a calculated syndrome exhibits an all-1 bit and the data has the specific pattern, the received data is judged as data struck with an uncorrectable error over the CPU bus or I/O bus. Consequently, fault information can be recorded without increasing the number of interface signals used to provide an interface with the main memory and needed to store fault information, and increasing the storage capacity of the main memory.
Moreover, in case where a fault recovery means for retrying an instruction transferred over a CPU bus or an I/O bus is not included, when a fault is detected, re-booting is not performed. Only when a CPU attempts to read the above-mentioned data, an interrupt is issued to the CPU in order to report that a fault has been detected. Even when fault-stricken data is written in the main memory, as long as the CPU does not attempt to read the data, the fault in the data can avoid a system failure (a system halt, re-booting, or any other failures directly recognized by a user). This contributes to improvement of system availability.
The present inventor has discussed aforesaid methods of constructing the code proposed in the prior art. Consequently, three drawbacks described below have become apparent. Namely, these drawbacks are that (1) when check bits are inverted, a syndrome calculated for received data exhibits an all-1 bit pattern and, consequently, has a multi-bit error pattern whose occurrence frequency is low; (2) if another one-bit error occurs in the main memory, the data encoded according to the specific error correcting code may be wrongly corrected; (3) since the data is rewritten to have a specific pattern, the original pattern of the data cannot be referenced. These drawbacks will be further described by taking examples.
To begin with, the drawbacks (1) and (2) will be described by taking introductory remarks. For brief sake, a single-bit error correcting/double-bit error detecting code (SEC-DED code) will be taken for instance. The SEC-DED code is defined, as shown in
FIG. 16
, such that a code length is eight bits and a check bit length is four bits. For a description of the code, refer to “Error-Control Coding for Computer Systems” (P.140) written by T. R. N. Rao and E. Fujiwara.
FIG. 16
shows an example of a parity-check matrix H (hereinafter, matrix H) and an example of arrangement of information bits and check bits. Each of the column vectors of the matrix H is referred to as h
0
, h
1
, . . . , h
7
.
FIG. 17
implies an example of the drawback (1). Assuming that a two-bit error occurs to involve bit positions d
0
and c
3
shown in
FIG. 16
, as a syndrome S an all-1 bit pattern is produced. Depending on a way of constructing a code, even if a syndrome produced exhibits an all-1 bit pattern, a multi-bit whose occurrence frequency is low is not detected as an error.
Referring to
FIG. 18
, the drawback (2) will be described. As shown in (1) of
FIG. 18
, an encoded word is [00000000], check bits are all inverted according to the conventional method, whereby data d=[00001111] is produced. Thereafter, a one-bit error occurs as shown in (2) of FIG.
18
. The data struck with the error is [00001110]. A syndrome for the data is, as shown in (3) of
FIG. 18
, calculated using the matrix H shown in FIG.
16
. The syndrome corresponds to the column vector h
0
in the matrix H shown in FIG.
16
. Consequently, it is judged that a one-bit error has occurred at the bit position d
0
. Eventually, the data is wrongly corrected into [10001110].
When the conventional method is adapted to an error control code generally implemented in computer systems, the drawbacks (1) and (2) may arise. Therefore, the conventional method cannot be applied to all error control codes but can be applied to the error control code that employs the matrix H of a specific bit pattern. However, the related art does not refer to what kind of code is applied to.
Next, the drawback (3) will be described below. Several patterns of data in which an error was detected were inspected. Consequently, data whose specific bit is struck with a stuck-at-zero error may be produced. If the patterns of such fault-stricken data are kept, they may help to analyze cause of the error. Therefore, if the patterns of fault-stricken data are discarded, it takes much time to analyze the cause of the error, thereby causing a Mean-Time-To-Repair (MTTR) to increase.
SUMMARY OF THE INVENTION
An object of the present invention is to provide an error detection and correction method be capable of encoding data so as to keep, as fault information, a detected result of uncorrectable error in an input data without changing the number of bits constituting the encoded word, and storing the resultant data in a main memory. Moreover, this method can avoid such a situation that the data is wrongly corrected in decoding the encoded data because of a failure to reproduce the fault information.
Another object of the present invention is to provide an error detection and correction method that does not discard the pattern of fault-stricken data and not hinder analysis of cause of an error.
Still another object of the present invention is to provide an error detection and correction method making it possible to accomplish the above objects without greatly modifying a known encoding circuit or decoding circuit.
These and other objects of the present invention and novel features thereof will be apparent from the description of this specification and the appended drawings.
The representative aspects of the present invention disclosed in this specification will be briefed below.
To begin with, the gist of the present invention will be described using the error control code described in conjunction with FIG.
16
. The SEC-DED code implied in
FIG. 16
is defined such that a code length is eight bits and an information bit length is four bits. The SEC-DED code may be referred to as (8, 4) SEC-DED code. Hereafter, the maximum code length in the SEC-DED code, in which the number of check bits is four, is known to being eight bits as described in page 139 of the above-mentioned literature. When the number of information bits that must be protected by an error control code is 2, column vectors associated with bit positions unallocated to information bits are deleted from the matrix H as shown in
FIG. 1
, by the number of unused bits in the information bits. A SEC-DED code employing the resultant matrix is therefore a (6, 2) SEC-DED code. When data is encoded with the bit positions of necessary information bits alone associated with column vectors of the matrix H, a removed code is referred to as a shortened code. The underlying idea of the present invention is that fault information is allocated to the bit positions associated with the deleted column vectors.
In an example shown in
FIG. 2
, bits of fault information e
0
and e
1
are all

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Error detection and correction method in a computer system... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Error detection and correction method in a computer system..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Error detection and correction method in a computer system... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3286262

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.