Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2007-04-17
2010-06-22
Bonzo, Bryce P (Department: 2113)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
Reexamination Certificate
active
07743285
ABSTRACT:
One embodiment relates to a high-availability computation apparatus including a chip multiprocessor. Multiple fault zones are configurable in the chip multiprocessor, each fault zone being logically independent from other fault zones. Comparison circuitry is configured to compare outputs from redundant processes run in parallel on the multiple fault zones. Another embodiment relates to a method of operating a high-availability system using a chip multiprocessor. A redundant computation is performed in parallel on multiple fault zones of the chip multiprocessor and outputs from the multiple fault zones are compared. When a miscompare is detected, an error recovery process is performed. Other embodiments, aspects and features are also disclosed.
REFERENCES:
patent: 5588111 (1996-12-01), Cutts et al.
patent: 6151684 (2000-11-01), Alexander et al.
patent: 6427163 (2002-07-01), Arendt et al.
patent: 7308605 (2007-12-01), Jardine et al.
patent: 7412479 (2008-08-01), Arendt et al.
patent: 2002/0152420 (2002-10-01), Chaudhry et al.
patent: 2005/0240806 (2005-10-01), Bruckert et al.
patent: 2007/0022348 (2007-01-01), Racunas et al.
patent: 2007/0260939 (2007-11-01), Kammann et al.
patent: 2007/0282967 (2007-12-01), Fineberg et al.
B. T. Gold, J. C. Smolens, B. Falsafi, and J. C. Hoe, “The Granularity of Soft-Error Containment in Shared Memory Multiprocessors”, 2006, Proceedings of the Workshop on Silicon Errors in Logic—System Effects (SELSE).
D. J. Sorin et al. “SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery”, Jun. 2002, In Proc. of 29th Intl. Symp. on Comp. Arch. (ISCA-29).
M. Prvulovic et al. “ReVive: cost-effective architectural support for rollback recovery in shared memory multiprocessors”, Jun. 2002, In Proc. of 29th Intl. Symp. on Comp. Arch. (ISCA-29).
T. M. Austin, “DIVA: A reliable substrate for deep submicron microarchitecture design”, Nov. 1999, In Proc. of the 32nd Intl. Symp. on Microarchitecture.
M. K. Qureshi et al. “Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors”, Jun. 2005, In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32).
J. Ray et al. “Dual use of superscalar datapath for transient-fault detection and recovery”, Dec. 2001, In Proceedings of the 34th International Symposium on Microarchitecture.
J. C. Smolens et al. “Fingerprinting: Bounding soft-error detection latency and bandwidth”, Oct. 2004. 224-234, In Proc. of Eleventh Intl. Conf. on Arch. Support for Program. Lang. and Op. Syst. (ASPLOS XI), Boston, Massachusetts.
S. K. Reinhardt and S. S. Mukherjee “Transient fault detection via simultaneous multithreading”, Jun. 2000, In Proceedings of the 27th International Symposium on Computer Architecture.
E. Rotenberg “AR-SMT: A microarchitectural approach to fault tolerance in microprocessors”, Jun. 1999, In Proceedings of the 29th International Symposium on Fault-Tolerant Computing.
T. N. Vijaykumar et al. “Transient-fault recovery using simultaneous multithreading”, May 2002, In Proceedings of the 29th International Symposium on Computer Architecture.
M. Gomaa et al. “Transient-fault recovery for chip multiprocessors”, Jun. 2003, In Proceedings of the 30th International Symposium on Computer Architecture.
S. S. Mukherjee et al. “Detailed design and evaluation of redundant multithreading alternatives”, May 2002, pp. 99-110, In Proceedings of the 29th International Symposium on Computer Architecture.
K. Sundaramoorthy et al. “Slipstream processors: Improving both performance and fault tolerance”, Oct. 2000, In ASPLOS.
W. Bartlett and B. Ball, “Tandem's Approach to Fault Tolerance,” Feb. 1988, pp. 84-95, Tandem Systems Rev., vol. 4, No. 1.
B. T. Gold et al. “TRUSS: a reliable, scalable server architecture”, Nov.-Dec. 2005, IEEE Micro.
Russ Joseph, “Exploring Salvage Techniques for Multi-core Architectures”, 2005, Workshop on High Performance Computing Reliability Issues.
F. Bower et al. “Tolerating hard faults in microprocessor array structures”, 2004, In proceedings of the International Conference on Dependable Systems and Networks.
Premkishore Shivakumar, Stephen W. Keckler, Charles R. Moore, and Doug Burger, “Exploiting Microarchitectural Redundancy For Defect Tolerance”, Oct. 2003, The 21st International Conference on Computer Design (ICCD).
Jayanth Srinivasan, Sarita V. Adve, Pradip Bose, Jude A. Rivers “Exploiting Structural Duplication for Lifetime Reliability Enhancement”, Jun. 2005, The Proceedings of the 32nd International Symposium on Computer Architecture (ISCA'05).
Bernick, D., Bruckert, B., Vigna, P. D., Garcia, D., Jardine, R., Klecka, J., Smullen, J., “NonStop® Advanced Architecture”, 2005, pp. 12-21, Proceedings of the International Conference on Dependable Dependable Systems and Networks (DSN'05).
T.J. Siegel, et al., “IBM's S/390 G5 Microprocessor Design” Mar./Apr. 1999, pp. 12-23, IEEE Micro, vol. 19, No. 2.
Aggarwal Nidhi
Jouppi Norman P.
Ranganathan Parthasarathy
Bonzo Bryce P
Hewlett--Packard Development Company, L.P.
LandOfFree
Chip multiprocessor with configurable fault isolation does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Chip multiprocessor with configurable fault isolation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Chip multiprocessor with configurable fault isolation will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-4218359