Method and apparatus for tolerating power outages of...

Electrical computers and digital processing systems: support – Computer power control – Power conservation

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C713S330000, C713S340000, C714S014000, C714S024000

Reexamination Certificate

active

06195754

ABSTRACT:

This invention relates generally to fault-tolerant multi-processor systems. In particular, this invention relates to methods for improving the resilience of a multi-processor system in power instability and power failure scenarios.
BACKGROUND OF THE INVENTION
Pre-existing systems provide the feature of tolerating power outages ranging in duration from small fractions of a second to hours. For the shortest outages, ranging up to tens of milliseconds, the tolerance has been absolute. Totally transparent operation has been provided.
Longer outages are not transparent. No service is provided during the power outage, but recovery (resumption of service) after the outage is relatively fast (typically less than one minute) due to preservation of full memory state and transparent resumption of all processes executing at the beginning of the power outage. This type of tolerance might be thought of as “hibernation” during the outage. Typically, this feature tolerates outages up to approximately two hours.
FIG. #_
1
illustrates multiple processor subsystems #_
110
a
, #_
110
b
, . . . , #_
110
n
composing a pre-existing multi-processor system #_
100
. Each processor subsystem #_
110
includes two power supplies, IPS #_
120
and UPS #_
130
; and a lost-memory detection circuit (not shown). Each processor subsystem #_
110
also includes its respective processor logic #_
140
, including a memory #_
180
and associated memory control logic #_
190
; a maintenance diagnostic processor (MDP) #_
150
; I/O controllers #_
160
; and IPC controllers #_
170
.
The interruptible power supply (IPS) #_
120
supplies power to the processor logic #_
140
(excluding the memory #_
180
and some of the memory control logic #_
190
but including the cache #_
1
H
0
, if any), the MDP #_
150
, and the I/O and inter-processor subsystem communications (IPC) controllers #_
160
, #_
170
. The uninterruptible power supply (UPS) supplies power to the memory #_
180
and some of the memory control logic #_
190
.
The UPS #_
130
typically includes a battery, such as the battery #_
1
A
0
. During normal operation, an alternating current (AC) power source (not shown) drives both the IPS and the UPS #_
120
, #_
130
and charges the battery supply #_
1
A
0
. Should the AC power source fail, the battery power supply #_
1
A
0
supplies power to the UPS #_
130
, thus enabling the UPS #_
130
to maintain the contents of the memory #_
180
valid during the power outage.
When the power supply control circuitry #_
1
I
0
detects a loss of AC power, it asserts a signal #_
1
C
0
herein termed the power-failure warning signal. This signal #_
1
C
0
connects to the interrupt logic of its respective processor system #_
110
so that software notices the loss of AC power via an interrupt.
The capacitance design of the power supply guarantees that the power-failure warning signal #_
1
C
0
occurs at least a predetermined amount of time (5 milliseconds, in one embodiment) before power from the IPS #_
120
becomes unreliable. The power supply control circuitry #_
1
I
0
switches the UPS #_
130
over to the battery supply #_
1
A
0
and shuts down the IPS #_
120
when the IPS #_
120
becomes unreliable.
The predetermined time guarantee allows the software to do two things before power is lost. First, the software recognizes the interrupt (even though there may be times when the power-failure warning signal interrupt is masked off, resulting in some delay in recognizing the interrupt). Second, the software saves state as described in more detail below.
Processor subsystems #_
110
with no cache or with write-through caches use a first predetermined guaranteed time. However, on processor subsystems #_
110
with write-back caches, the time necessary to save cache to the memory #_
180
can be substantial. An alternative, larger predetermined guaranteed time is calculated by estimating the worst-case time necessary to save every line of the largest cache.
When AC power returns, the power control circuitry resumes IPS-based operation, starts charging the battery supply #_
1
A
0
again, and asserts a power-on signal. This signal causes the MDP #_
150
to reset and bootstrap itself and then to control the resetting of the processor #_
140
.
The lost-memory detection circuit contains a flip-flop (not shown) to determine whether memory contents are valid after a power outage. The power-supply circuitry explicitly sets (e.g., to logical TRUE) the flip-flop whenever power from the UPS #_
130
is restored. The processor subsystem #_
110
clears the flip-flop (e.g., sets it to logical FALSE) during power-on processing, after saving its value into a reset control word. This flip-flop retains its value as long as UPS power #_
140
is valid.
Boot code receives the reset control word when the processor #_
140
is reset. The boot code uses this information to decide whether to initiate automatic power-on recovery when memory contents are valid or wait in a loop for instructions when memory contents were lost.
The power-failure warning signal &pgr;_
1
C
0
(if not masked) raises a software interrupt, and the software begins executing a power-fail interrupt handler. The interrupt handler immediately stops all I/O activity. (This early action is necessary on systems without DMA I/O capability because the handling of reconnects could cause the state-saving steps described below to proceed too slowly, resulting in a failure to recover from a power outage.)
The main function of the power-fail interrupt handler, however, is to save such processor state as is necessary for resumption of operations after the power outage ends. While all processors (of known design) would save their working (general purpose) registers, different types of processors #_
140
save different state. For example, translation lookaside buffer (TLB) entries and I/O Control (IOC) entries both exist in volatile processor state. Processors #_
140
with TLB or IOC entries save such state to memory before power is lost.
After saving the necessary state, the power-fail interrupt handler sets a state-saved variable in system global memory to logical TRUE. This variable is initialized to FALSE at cold load or reload time and is also set to FALSE on a power-on event.
Next, the interrupt handler executes a power-fail shout mechanism, described below.
Finally, the interrupt handler executes the code responsible for somewhat gracefully stopping all I/O and IPC traffic and flushing dirty cache contents (if any) to main memory. For example, in the IPC case, both the sending and receiving DMA engines are instructed to finish handling the current packet and then stop operation. The completion status is saved for later use.
(When the network services return to normal operating mode, if the DMA engine was in operation when the power down was performed, then the saved status of that last operation is examined. If that completion was normal, then the DMA engine is restarted with any queued operations. If that completion was an error termination, then the normal error recovery for that operation is performed (except that notification of the client may be deferred because interrupts may be disabled). At the next opportunity for I/O interrupts, the aborted non-inter-processor-subsystem-communications transfers are delivered to network services clients.)
On systems with write-back caches, dirty cache lines are saved to memory as the IPS #_
120
supplies the cache with power and thus its contents are not preserved during the outage.
Control then transfers to the software that signals the hardware to fence the external (I/O bus and IPC path) drivers so that garbage is not driven onto these busses when power becomes unreliable.
At this point, the software waits for one of two things to happen. One possibility is that this power outage is either very short or a brown-out. In this case, IPS po

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for tolerating power outages of... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for tolerating power outages of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for tolerating power outages of... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2615131

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.