Processor array and parallel data processing methods

Electrical computers and digital processing systems: processing – Processing architecture – Array processor

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S017000, C712S021000, C712S027000, C712S201000, C712S225000, C709S238000, C711S106000

Reexamination Certificate

active

06728862

ABSTRACT:

TECHNICAL FIELD
This invention relates to computers. In particular the invention relates to massively parallel computers having processor arrays and methods for using arrays of processors to solve problems. Specific embodiments of the invention are particularly useful for image processing.
BACKGROUND
Image processing is both computationally intensive and data intensive. By way of example, using an MPEG (“Motion Picture Experts Group”) image compression algorithm to compress a 20 Megabytes-per-second television signal in real time may require on the order of 200 billion arithmetic operations per second. The goal of providing cost effective computer systems capable of providing the extremely high throughput required for image processing and similar tasks has so far eluded the computer industry.
One way to achieve higher throughput in computer image processing systems is to use a higher speed processor. The processor could be any of several types commonly in use, such as RISC (reduced instruction set computer), CISC (complex instruction set computer), DSP (digital signal processor), or VLIW (very long instruction word). A basic problem with applying a high speed processor to data intensive applications such as image processing is that the processor typically spends a significant amount of time moving data to and from the memory. Further, when a single processor is used, the inherently parallel nature of many image processing algorithms must be broken down by the programmer into a serial program which works with one or at most a few pixels at a time.
Another common approach to achieving real-time performance in difficult image processing applications is to build custom hardware to perform the image processing. To do so, a problem is typically broken down into its main functional steps, and each step is implemented by different hardware sub systems. The hardware may be provided on an application specific integrated circuit (ASIC) or the like. Such hardware-based solutions do not typically scale up very well to larger image sizes, nor are they readily applicable to other problems.
A further way to achieve higher throughput is to divide the image processing task between many processor elements (PEs). For inherently two-dimensional (2D) problems, such as image processing, which deal with 2-dimensional arrays of data elements, such as pixels, it is natural to arrange a number of processing elements so that each processing element is logically arranged at a node of a 2-dimensional grid. Local connections are provided between neighbouring processors. A natural way to implement many 2D problems is to assign a single processor element to each data element. That is, to provide processor elements arranged at nodes of a mesh which has the same dimensions as the array of data elements that it manipulates. There are many examples of the use of computer processor arrays for solving image processing and other computational problems.
An architecture that assigns only a few data elements per processor element is termed “fine-grained”. In contrast, a coarse grained architecture has many data elements assigned to each processor element. M. J. Flynn Very High Speed Computing Systems, Proceedings of the IEEE, Vol. 54, No. 12, pp. 1901-1909 (1966) categorized parallel processing computing systems into three categories: SIMD (single instruction stream, multiple data streams), MIMD (multiple instruction streams, multiple data streams) and MISD (multiple instruction streams, single data stream). In a SIMD system, the same instruction is broadcast to all processor elements. Each processor element has its own set of registers along with some means for it to receive unique data (such as a data value for a particular pixel in an image). In SIMD systems each individual processor element can be simple because it does not require a separate program counter or logic for fetching instructions from memory. Consequently, SIMD arrays can be well suited for fine-grained architectures.
In MIMD architectures every processor element has its own program store and can operate independently of other processor elements. A MIMD processor array may also be termed a “multi-computer”, because each processor element is full computer in its own right. MIMD architectures are not as well suited to fine-grained problems such as image processing because each processor element in a MIMD array is more complicated than, and requires larger circuits than, its counterpart in a SIMD array. Further, inter-processor contention for shared resources is an issue because the processor elements in a MIMD array operate independently.
In MISD architectures a single stream of data is passed along a chain of processors with a different operation performed at each step in the chain. Systems which implement MISD architectures are more commonly referred to as systolic arrays, and are well suited to signal processing and video scan line processing, but not well suited to problems such as image compression that require two-dimensional operations.
In a SIMD array it is difficult to implement algorithms where one group of processor elements is required to operate differently from another group of processor elements. In some SIMD architectures individual processor elements can conditionally skip instructions (SIMD architectures without this capability can achieve the effect of condition statements through more complicated mathematical expressions).
Models for studying and modelling parallel computing have been proposed in which there are multiple instruction streams each of which is provided to a specific set of processing elements and multiple data streams. Such models are termed MSIMD models. Typically each instruction stream is associated with a specific data stream.
A key problem with using any parallel array of processors is to program the processors in the array in such a way that the parallelism is well utilized (i.e. so that a good proportion of the processors are kept busy most of the time). As a simple example, consider the following conditional branch structure, coded in the C programming language. Such a conditional sequence might occur where the behaviour of some processor elements (e.g. processor elements processing pixels which are located at the boundary of an image) needs to be different from all other processor elements.
if (r
0
==0)
{
/* Sequence A for non-boundary pixels*/
. . .
}
else
{
/* Sequence B for boundary pixels*/
. . .
}
In this example, r
0
is the symbolic name for a register in each processor element. The processor element executes either sequence A or sequence B depending on the state of its r
0
register. It can be appreciated that if sequence A and sequence B are equally long then each processor element will be utilized only 50% of the time because it will have to skip one or other of the conditional branches. The processor elements all receive the same instruction stream. While a processor element is skipping instructions it is not performing useful work.
A table lookup operation is another example of inefficient utilization of a parallel array. Consider a table lookup operation wherein each processor element is required to retrieve an element from a table based on the contents of a register. Table lookup operations of this type are used commonly, for example, to implement such tasks as colour correction, contrast enhancement, or texture mapping. Typically the table is much larger than the memory available at each processor element. Even if there were sufficient data storage at each processor element it would be a poor use of memory resources to have a copy of the same table in the memory of every processor element. Since each processor element requires access to a specific element of the table either the table will be stored in an external memory the entire table must be broadcast to every processor element. If the table is stored in an external memory then there will be contention problems caused by a large number of processor elements attempting simultaneously to access the table. If th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Processor array and parallel data processing methods does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Processor array and parallel data processing methods, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Processor array and parallel data processing methods will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3192091

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.