Detecting and mitigating memory device latchup in a data...

Error detection/correction and fault detection/recovery – Pulse or data error handling – Memory testing

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S764000

Reexamination Certificate

active

06799288

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to systems and methods for detecting and responding to errors and failures in a memory device, and particularly, for such systems and methods in space applications.
2. Description of the Related Art
Computer memory and other semiconductor components are susceptible to environmental effects which can cause them to fail. One class of failures occurs as a result of exposure to radiation. The environmental conditions for space applications present radiation which produces this class of failures. Such radiation can be devastating to a satellite lacking adequate safeguards. When cosmic radiation passes through a sensitive semiconductor component in a satellite, one of three possible conditions may result.
In a microprocessor or RAM chip, a single-event upset (SEU) can occur wherein the contents of a particular memory address or register are inverted (e.g. a bit flips from 0 to 1). As a result, sensor data can be corrupted, algorithms can fail, and the satellite firmware can be adversely affected. A corrupted program could attempt to execute random code or data in the memory may be lost.
The second condition is a single-event latchup (SEL). In this case, the affected component latches into a state where it dissipates a dangerously high amount of current, until the power to the device is reset. If the current is not limited, the system power supply may also fail, or its voltage may dip down below acceptable levels for normal system operation, affecting many other major onboard systems. Also, if the device is not rated for the high current dissipation, it may be destroyed.
The third condition induced by cosmic radiation is a single-event burnout (SEB). In this case, the affected device is destroyed immediately following exposure. Unlike SEUs and SELs (where the device is not destroyed and may be reset), the only adequate response to an SEB is to invoke a redundant device.
Furthermore, different semiconductor devices have different susceptibilities to radiation induced failures. Some device designs may reduce (or virtually eliminate) the risk of a radiation induced failure, however, it is often not reasonable to apply such techniques to every semiconductor device. In general, the higher the capacity of a memory device, the more susceptible it is to failures, including latchup. Thus, very high capacity memory devices, e.g. 64 Mbit devices, have a relatively high susceptibility. Therefore, systems and methods to protect these devices are especially important.
FIG. 1
is a block diagram of a typical prior art system
100
for latchup detection and mitigation. The system
100
includes hardware detection and reset components entirely separate from the software and other operations of the computer system
102
which it monitors. The monitored computer system
102
includes a central processing unit (CPU)
104
, one or more memory devices
106
, such as silicon based SDRAM and input/output devices
108
which are used to monitor and control various subsystems. The CPU
104
utilizes the memory
106
, comprising one or more memory devices
106
A,
106
B, to store programs data and information which are being processed and used by the computer
102
. Program data and information are transferred between the CPU
104
and memory
106
via the data bus
110
as the computer
102
operates.
The latchup detection and mitigation system
100
operates by monitoring the current consumption of the memory
106
via links
112
. Harmful radiation
114
may impinge at least one of the memory devices
106
A, causing a single event latchup (SEL) in the memory device
106
A. As a result, the latched up memory device
106
A begins to draw an excessive amount of current from the memory power supply
116
. The current measurement hardware
118
is continually monitoring the current draw by the memory devices
106
from the power supply
116
and relays the information to the threshold detection hardware
120
. When an unsafe threshold is reached by any of the memory devices
106
, the detection hardware
120
signals a reset to power supply for at least the affected memory device
106
A. For simple processor designs in which the power supply powers both the memories and the processor, the power supply reset will shut down power to the entire processor
102
.
The additional hardware adds to the cost and mass of the overall computer system
100
. In addition, the hardware of the described system
100
increases the complexity and reduces the reliability of the computer system
102
. Furthermore, this system
100
only detects and eliminates SELs that result in an excessive current draw which could damage or destroy hardware. It does not check for SEUs or other innocuous memory failures which do not result in a high current draw. Finally, because the system is hardware based, it is not easily or inexpensively altered to meet a change in requirements or to implement improvements.
There is a need for systems and methods which can detect and respond appropriately to single event failures of any type. If a memory device latches up so that it completely fails, power needs to be removed from it in a timely manner, even if that means immediately shutting down the entire processor. On the other hand, if the memory experiences a SEU, the system and method need to correct the error(s) without interrupting the functionality of the processor. Furthermore, there is a need for such systems and methods to function without requiring additional hardware components. There is also a need for such systems and methods to be inexpensive, reliable, light and easily modified. The present invention meets all of these needs.
SUMMARY OF THE INVENTION
The present invention discloses an apparatus, method and article of manufacture for detecting memory device failures. The exemplary method comprises detecting errors in data stored in a memory device from the data transacted with a processor, correcting the detected errors in the data transacted with the processor, tracking the detected errors in the memory device, determining when the memory device has failed based upon the tracked detected errors and resetting the memory device when the memory device fails testing. Errors can be corrected such that no erroneous data is transacted with the processor.
In one embodiment, the error detection and correction is carried out by a hardware logic device on the data bus, and the failure determination and resetting are performed by software.
The invention tracks how frequently error correction is required and uses this information to determine if the memory device has failed. When a memory device failure is determined, the invention resets the memory device by signaling a power supply of the memory device to cycle. Errors will appear as a result of ordinary data transactions between the processor and memory device as it operates. The invention also identifies erroneous latchups as latchups detected soon after powering. In this case the indicated latchup is ignored.
In one embodiment, the invention also affirmatively tests the memory device, e.g. by periodically performing a write operation of test data to the memory device, followed by a read operation of the test data from the memory device. A failure of the memory device is determined based upon error correction required in response to the test (e.g. the read operation). However, errors in the test data are corrected such that no erroneous test data is transacted with the processor.
The present invention responds to memory device errors (e.g. SEUs) as well as failures (e.g. SELs). The error correction logic monitors the overall “health” of the data stored within the memory device. This monitoring is facilitated through periodic testing (e.g. read/write operations). When error correction for a memory device becomes excessive, indicating a failure beyond the scope of a simple SEU, a failure is deduced and a memory reset is performed.


REFERENCES:
patent: 6560725 (2003-05-01), Longwell et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Detecting and mitigating memory device latchup in a data... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Detecting and mitigating memory device latchup in a data..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Detecting and mitigating memory device latchup in a data... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3201378

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.