Process for reconfiguring an information processing system...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S003000

Reexamination Certificate

active

06789214

ABSTRACT:

FIELD OF THE INVENTION
The invention relates to a “warm” process for dynamically reconfiguring an information processing system upon detection of a failure of at least one component.
The invention applies more particularly to multiprocessor systems, and even more particularly to symmetric multiprocessor systems of the “SMP” type.
BACKGROUND OF THE INVENTION
Within the context of the invention, the term “component” should be considered in its most general sense. It includes hardware components, for example a processor in a system of the aforementioned “SMP” type, as well as software components, for example one of the modules of the operating system.
The term “failure” should also be understood in its most general sense. Naturally, it can refer to malfunctioning components. However, within the context of the invention, it most often refers to one or more components having a high risk of actually malfunctioning.
These are generally situations in which a parameter associated with the operation of one of the components of the system moves above or below a predefined critical threshold. To illustrate the concept, let us take the example of the detection and correction of errors in data recorded in a processor's random access memory. It is normal to record redundant data in addition to the data actually used, and to use an error correction mechanism known by the abbreviation “EEC” (for “Error Correcting Code”). This mechanism uses the redundant data to correct the errors detected. However, one may decide, as a preventative measure, to consider all or part of this random access memory to be malfunctioning if the detected error rate passes a predetermined threshold, and to stop using this memory or memory part.
It is easy to see that the “shutdown” of a component, no matter which one, poses a certain number of problems, and cannot be done without certain precautions. This component essentially constitutes what is known as a computing resource (hardware or software) of the information processing system. First of all, if it is currently running (for example processing a task, if it is a processor) or being used by other components of the system, it cannot be shut down immediately. It is necessary for the current operation to finish, or at least be taken over by other components. If not, serious and irrecoverable errors (data losses, etc.) can occur. It is then necessary to isolate the malfunctioning component from the other parts of the system. Finally, it is most often necessary to reconfigure the system while waiting for the defective component to be repaired and/or replaced.
All these operations, especially the last two, require a shutdown of all or part of the system.
For certain applications using what are called high-availability systems, a halt in operation is unacceptable.
Even outside the field of these specific applications, current requirements tend toward this need for high availability.
From the point of view of the user or the client, the concept of high availability often constitutes a major factor in the choice of a system. It is now common to expect continuous operation, i.e., 24 hours a day, 7 days a week.
Hardware is becoming more and more complex and specialized, accordingly requiring higher reliability in order to obtain at least the same availability and life cycle as in the past, given that current requirements tend to increase. These requirements are most critical for the very high-capacity systems known as “mainframes.”
It follows that new maintenance procedures need to be used: all maintenance interventions must henceforth be planned. The object is to disturb the normal operation of the system as little as possible. In fact, any unplanned maintenance generates extremely high costs and/or losses. These losses and costs naturally have a substantial impact on the user or client's confidence.
More particularly, there is a specific problem relative to the hardware components. It is necessary to replace components that are likely to malfunction, which generally requires a shutdown of the machine. But the machine, in keeping with the aforementioned requirements, must continue to operate, which is clearly incompatible.
Also in the prior art, the proposed solutions have tended to favor the aforementioned high availability. For this reason, so-called “hardware” solutions are used. These solutions generally involve a specific configuration, and may require equally specific software drivers.
These solutions can be categorized as follows:
a connection during operation called a “hot plug”: this involves a subassembly of specialized circuits that makes it possible to isolate, disconnect and replace the so-called malfunctioning components;
a machine architecture called a “cluster,” in which several machines work together: if one of the machines malfunctions, another machine belonging to the set replaces it, taking advantage of the fact that the data are redundant and distributed among these machines; and
a (so-called “multipath”) hardware redundancy, in which two or more components work in parallel: if one of the components malfunctions, the other parallel components take over.
This last variant ensures a “two-out-of-three” redundancy.
All of these solutions have drawbacks:
a connection during operation of the “hot plug” type requires the existence of specialized circuits, which in turn generates increased complexity in the system and a corresponding increase in the cost of production, and the method requires special support from the operating system and cannot be applied to existing systems after the fact;
by nature, a “cluster” of machines is constituted by at least two complete machines: the hardware redundancy is therefore substantial and the need to use a specific operating system for this type of system architecture results in a very high cost; and
hardware redundancy is, a priori, the most expensive of the techniques, since it involves multiplying all of the components constituting the system, or at least the majority of them.
SUMMARY OF THE INVENTION
The object of the invention is to eliminate the drawbacks of the processes and devices of the prior art, some of which have just been mentioned.
The subject of the invention is a process that makes it possible to meet the needs that have arisen without having the drawback of the processes and devices according to the prior art, some of which have just been mentioned.
According to the process of the invention, it is no longer specifically necessary to provide some type of hardware redundancy.
For this reason, according to a first important characteristic, the process according to the invention comprises steps that, upon detection of a component failure, allow a temporary “freeze” of the system in a way that may be called “warm” or “on the fly,” i.e. to halt, in an orderly fashion, all of the activities in progress at a so-called coherent break point in the system's operation. This “freeze” consists in a mutual exclusion mechanism.
The term “failure” should be understood in the sense indicated above, i.e., generally an estimated failure risk.
Once this state is obtained, a reconfiguration of the system is then performed. In essence, it becomes possible to perform a dynamic reallocation or a de-allocation of any resource of the system, whether hardware or software: processor, memory, software module (including a part of the operating system), or any other component.
To do this, according to another important characteristic of the invention, specific processes or tasks are implemented, which are run in a non-preemptive and synchronized manner, under the control of one of the so-called master processes or of a component serving as such.
The malfunctioning component is then isolated from the system.
Finally, the operating system of the system is released and the system can once again run normally.
It must be clearly understood that all of the preceding steps are transparent for the operating system, for the user (system administrator, etc.) and also for the user applications. It has also been indicated that the ma

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Process for reconfiguring an information processing system... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Process for reconfiguring an information processing system..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Process for reconfiguring an information processing system... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3243639

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.