Electrical computers and digital processing systems: processing – Processing control – Generating next microinstruction address
Reexamination Certificate
2000-04-20
2004-02-03
Tsai, Henry W. H. (Department: 2183)
Electrical computers and digital processing systems: processing
Processing control
Generating next microinstruction address
C712S235000, C712S240000
Reexamination Certificate
active
06687812
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a parallel processing apparatus, and, more particularly, to a parallel processing apparatus which processes programs in parallel while generating and terminating a thread consisting of a plurality of instructions in a plurality of processors.
2. Description of the Related Art
Today's typical computers are of a von Neumann-type whose built-in processor that plays the key role in each computer repeats a sequence of procedures of fetching a single instruction, decoding it, executing a process specified by that instruction, accessing the memory and writing the execution result back in the memory.
To improve the processing speed, the current computers each have a cache memory with a fast access speed provided between the main memory and the processor. The processor therefore mainly exchanges data with the cache memory. The operation of the processor of reading an instruction from the cache memory is called “instruction fetching”, and the operation of decoding an instruction is called “instruction decoding”, and the operation of writing the execution result back into the memory is called “write back”.
Pipelining is known as one of techniques that improve the processing speed of processors. The “pipelining” process is described in many books about computers, for example, “Computer Architecture” by Hennessy and Patterson. Pipelining is the technique that improves the processing performance by allowing a plurality of instructions, each of which performs only a part of the entire process, to be executed in an overlapped manner in one clock cycle.
FIG. 13
is a diagram for explaining a pipelining process.
An instruction is executed in separate pipeline stages called “instruction fetching (IF)”, “instruction decoding (ID)”, “instruction execution (EX)”, “memory access (MEM)” and “write back (WB)”. In cycle T
1
, an instruction at an address “1000” undergoes instruction fetching. In cycle T
2
, the instruction at the address “1000” undergoes instruction decoding and an instruction at an address “1004” undergoes instruction fetching at the same time. This technique of simultaneously executing a plurality of instructions in an overlapped manner is called “pipelining”. Registers placed between processes are called “pipeline registers” and a process unit for carrying out each process is called a “pipeline stage”. As is apparent from the above, pipelining speeds up the processing as a whole by executing instructions, described in a program, in parallel.
However, there occurs a circumstance where an instruction cannot be executed in the proper cycle due to a change in the program flow caused by a branching instruction. While a scheme of computing the address of the branching destination specified by the branching instruction at an early stage in the pipeline stages, such as the ID stage, is taken for faster processing, the branching destination cannot be determined for a conditional branching instruction until the condition is determined. For a conditional branching instruction, therefore, the cycle that stops pipelining is eliminated by carrying out a scheme of predicting whether or not its branching condition is satisfied by using history information (see pp. 302 to 307 in the aforementioned book entitled “Computer Architecture” by Hennessy and Patterson).
A “superscalar” system (“Superscalar” by Johnson) which improves the processing speed by providing a plurality of processing elements or processor elements in a single processor and simultaneously generating a plurality of instructions has already been put to a practical use. The superscalar system is ideally capable of executing instructions equal in number to the provided processor elements in one clock. It is however said that even if the number of processor elements should be increased limitlessly, instructions would not be smoothly executed due to a branching instruction and the actual performance would be restricted to about three to four times that of the case of using a single processor.
Another practical way of improving the processing speed is to perform parallel processing by using a plurality of processors. In a typical processor system which accomplishes parallel processing by using a plurality of processors, parallel processing is executed by carrying out communication among the processors to assign processes to the individual processors. A system which uses conventional processors accomplishes such communication by an interruption processing scheme that is from outside carried out externally each processor as an interrupt control on that processor.
In the interruption processing scheme, when an external unit interrupts a processor, a program to be executed in the processor is switched to an interruption program from a user program and the interruption process is then executed. When the interruption process is completed, the original user program is resumed. To switch the execution program in a processor, data which will be used again by the original user program, such as data in the program counter or register file, is saved in a memory device. The overhead that is need for such data saving for switching between programs is nonnegligibly large and an interruption process is generally takes time. A parallel processing system which uses interruption processing therefore suffers a large overhead in communications between processors, which is an impediment in enhancing the performance.
One solution to this problem is a so-called multi-thread architecture. This technique is disclosed in, for example, “A Multi-threaded Massively Parallel Architecture”, Proceedings of 19th International Symposium on Computer Architecture, R. S. Nikhil, G. M. Papadopuolos, and Arvind, pp. 156-167.
A “thread” is a sequence of instructions. A program consists of a plurality of threads. In a multi-thread architecture, thread-by-thread processes are assigned to a plurality of processors so that those processors can process threads in parallel. Therefore, the multi-thread architecture has a mechanism and an instruction for allowing a thread which is being executed on one processor to generate a new thread on another processor.
The generation of a new thread on another processor is called “to fork a thread” and an instruction to fork a thread is called a “fork instruction”. A fork instruction specifies to which processor element a thread should be forked and which thread to fork.
Control parallel processing has been proposed in, for example, “Proposition Of On Chip MUlti-Stream Control Architecture (MUSCAT)” by Torii et al., Joint Symposium Parallel Processing JSPP '97, pp. 229-236. The multi-stream control architecture analyzes the control flow of a program, predicts a path which is very likely to be executed soon, and speculatively executes the path before its execution is established. In this manner, the multi-stream control processes programs in parallel.
FIG. 14
is a diagram showing a model of multi-stream control.
A conventional sequence of instructions which are executed sequentially consists of threads A, B, and C. In the sequential execution, one processor processes the threads A, B, and C in order as shown in section (a) in FIG.
14
. In the multi-stream control, by way of contrast, while a processor element (PE)#
0
is processing the thread A, the thread B which is expected to be executed later is forked to and is processed by a processor element #
1
as shown in section (b) in FIG.
14
. The processor element #
1
forks the thread C to a processor element #
2
. The speculative execution of threads which are expected to be executed later can ensure parallel processing of threads, thus improving the processing performance.
The aforementioned paper that has proposed the “MUSCAT” mentions that it is not always possible to predict, before execution, whether or not a thread is to be forked. It is also known that adequate parallel processing cannot be achieved merely by the established forking that involves threads whose forking has been e
NEC Corporation
Tsai Henry W. H.
LandOfFree
Parallel processing apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Parallel processing apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Parallel processing apparatus will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3351119