Fault resilient/fault tolerant computing

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S025000

Reexamination Certificate

active

06205565

ABSTRACT:

BACKGROUND OF THE INVENTION
The invention relates to fault resilient and fault tolerant computing.
Fault resilient computer systems can continue to function in the presence of hardware failures. These systems operate in either an availability mode or an integrity mode, but not both. A system is “available” when a hardware failure does not cause unacceptable delays in user access. Accordingly, a system operating in an availability mode is configured to remain online, if possible, when faced with a hardware error. A system has data integrity when a hardware failure causes no data loss or corruption. Accordingly, a system operating in an integrity mode is configured to avoid data loss or corruption, even if the system must go offline to do so.
Fault tolerant systems stress both availability and integrity. A fault tolerant system remains available and retains data integrity when faced with a single hardware failure, and, under some circumstances, when faced with multiple hardware failures.
Disaster tolerant systems go one step beyond fault tolerant systems and require that loss of a computing site due to a natural or man-made disaster will not interrupt system availability or corrupt or lose data.
Typically, fault resilient/fault tolerant systems include several processors that may function as computing elements or controllers, or may serve other roles. In many instances, it is important to synchronize operation of the processors or the transmission of data between the processors.
SUMMARY OF THE INVENTION
In one aspect, generally, the invention features synchronizing data transfer to a computing element in a computer system including the computing element and controllers that provide data from data sources to the computing element. A request for data made by the computing element is intercepted and transmitted to the controllers. Controllers respond to the request and at least one controller responds by transmitting requested data to the computing element and by indicating how another controller will respond to the intercepted request.
Embodiments of the invention may include one or more of the following features. A controller may respond to the intercepted request by indicating that the controller has no data corresponding to the intercepted request and by indicating that another controller will respond to the intercepted request by transmitting data to the computing element. Each response to the intercepted request by a controller may include an indication as to how each other controller will respond to the intercepted request.
The computing element may compare the responses to the intercepted request for consistency. When each response includes an indication as to how each other controller will respond to the intercepted request, the comparison may include comparing the indications for consistency. When responses of two or more controllers include requested data, the comparison may include comparing the data for consistency. The computing element may notify the controllers of the outcome of the comparison and that responses have been received from all of the controllers.
A controller may be disabled when the responses are not consistent. In addition, an error condition may be generated if the computing element does not receive responses from all of the controllers within a predetermined time period.
A data source may be associated with a controller, and the controller may obtain the requested data from the data source in response to the intercepted request.
A controller may maintain a record of a status of another controller, and may use the record when indicating how the other controller will respond to the intercepted request. When a data source is associated with the other controller, the record may include the status of the data source. Each controller may maintain records of statuses of all other controllers and may use the records to indicate how the other controllers will respond to the intercepted request. When each controller is associated with a data source, each controller may maintain records of statuses of data sources associated with all other controllers.
When a status of a data source associated with a controller changes, the controller may transmit to the computing element an instruction to discard responses from other controllers to the intercepted request. The computing element may respond to the instruction by discarding responses from other controllers to the intercepted request and by transmitting to the controllers a notification that the responses have been discarded. A controller may respond to the notification by updating a record of the status of the data source. After updating the record, the controller may retransmit the requested data to the computing element and indicate how the other controller will respond to the intercepted request.
When a data source is associated with each controller, each controller may respond to the intercepted request by determining whether an associated data source is expected to process the request, and when the associated data source is expected to process the request, transmitting the request to the associated data source, receiving results of the request from the associated data source, and forwarding the results of the request to the computing element. When the associated data source is not expected to process the request, the controller may respond by informing the computing element that no data will be provided in response to the request.
In another aspect, generally, the invention features maintaining synchronization between computing elements processing identical instruction streams in a computer system including the computing elements and controllers that provide data from data sources to the computing elements, with the controllers operating asynchronously to the computing element. Computing elements processing identical instruction streams each stop processing of the instruction stream at a common point in the instruction stream. Each computing element then generates a freeze request message and transmits the freeze request message to the controllers. A controller receives a freeze request message from a computing element, waits for a freeze request message from other computing elements, and, upon receiving a freeze request message from each computing element processing an identical instruction stream, generates a freeze response message and transmits the freeze response message to the computing elements. Each computing element, upon receiving a freeze response message from a controller, waits for freeze response messages from other controllers to which a freeze request message was transmitted, and, upon receiving a freeze response message from each controller, generates a freeze release message, transmits the freeze release message to the controllers, and resumes processing of the instruction stream.
Embodiments of the invention may include one or more of the following features. The common point in the instruction stream may correspond to an I/O operation, the occurrence of a predetermined number of instructions without an I/O operation, or both.
A controller may include a time update in the freeze response message, and a computing element, upon receiving a freeze response message from each controller to which a freeze request message was transmitted, may update a system time using the time update from a freeze response message. The computing element may use the time update from a freeze response message generated by a particular controller.
Upon receiving a freeze response message from each controller to which a freeze request message was transmitted, a computing element may process data received from a controller prior to receipt of freeze response messages from the controllers.
In another aspect, generally, the invention features handling faults in a computer system including error reporting elements and error processing elements. An error reporting element detects an error condition and transmits information about the error condition as an error message to error processing elements connected to the erro

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Fault resilient/fault tolerant computing does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Fault resilient/fault tolerant computing, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fault resilient/fault tolerant computing will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2533432

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.