Optimization of instruction stream execution that includes a...

Electrical computers and digital processing systems: processing – Processing architecture – Long instruction word

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S218000, C712S219000, C712S228000

Reexamination Certificate

active

06425069

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to a superscalar processor with very long instruction word (VLIW)-like dispatch groups and more particularly to the decode and treatment of instructions which access volatile address space.
BACKGROUND OF THE INVENTION
Superscalar processors employ aggressive techniques to exploit instruction-level parallelism. Wide dispatch and issue paths place an upper bound on peak instruction throughput. Large issue buffers are used to maintain a window of instructions necessary for detecting parallelism, and a large pool of physical registers provides destinations for all of the in-flight instructions issued from the window beyond the dispatch boundary. To enable concurrent execution of instructions, the execution engine is composed of many parallel functional units. The fetch engine speculates past multiple branches in order to, supply a continuous instruction stream to the decode, dispatch and execution pipelines in order to maintain a large window of potentially executable instructions.
The trend in superscalar design is to scale these techniques: wider dispatch/issue, larger windows, more physical registers, more functional units, and deeper speculation. To maintain this trend, it is important to balance all parts of the processor-any bottlenecks diminish the benefit of aggressive techniques, however data dependent decode very adversely affects the performance gains of these techniques.
FIG. I illustrates a block diagram of a typical processing system. The processing system includes a processor
301
and a cache
302
which communicates with a host bus
300
. The host bus also communicates with a memory controller
303
which in turn provides and receives information from the system memory
304
. The memory controller
303
in turn communicates with another bus, in this example a PCI bus
100
. The PCI bus communicates with an IDE controller which in turn is connected to a hard disk drive
111
. Also the PCI bus communicates with a video adapter
102
which is in turn coupled to a CRT
112
. PCI bus
100
also is coupled to an ISA bus through a PCI/ISA interface
103
. The ISA bus
200
in turn is coupled to an ethernet or TokenRing controller which is coupled to a network or a local area network (LAN). It communicates with another video adapter
202
which has its associated CRT
212
and an IDE controller
201
which is coupled to a hard disk drive
211
.
One of the critical bottlenecks in such a processing system is load and store bandwidth, this is particularly true for machines which operate at higher frequencies because of the growing disparity in processor, I/O bus, and main memory operating frequencies. Since most processor architectures which are currently prevalent, x86 (IA-
32
), PowerPC/AS, ARM, etc., were implemented before this memory/logic frequency disparity became so pronounced, many contain an implementation or manifest some type of volatile I/O space or strongly ordered memory in one or more of their respective system architectures.
This can simply be defined as address space which if accessed multiple times will respond with different data. An example of this would be a memory-mapped FIFO in a video or communications adapter, or a multiplicity of addresses which if accessed in different order will respond with different data.
The requirement that this be supported has a devastating effect on processor implementations and performance because it requires the physical or effective address (depending on the architecture) to be compared against some table, range register, or other checking mechanism to determine if the address can be accessed out-of-order. This is further compounded by attempts at adding wider dispatch groups which optimally can be done in a VLIW-like dispatch group which has no ability to maintain ordering within the dispatch group. Since the actual address is not known at instruction decode time a processor which implements such a VLIW-like dispatch groups must block execution and flush the VLIW-like dispatch group and reformat the individual instructions of the VLIW-like word into the individual instructions forming a safe and lower performance sequence.
In a very high-frequency processor which has a deep pipeline this has an unacceptably high performance penalty for any code stream which might even occasionally access this type of storage.
This problem manifests itself in a processor supporting the PowerPC/AS architecture. Additionally, all addresses within the particular guarded range must be accessed in program order. Guarded is defined in this application as an address which must only be accessed once for each datum. There is no way to distinguish between guarded storage for different adapter/devices so all accesses to guarded space must be performed in strict program order.
Direct storage is different from guarded because a single memory address can be accessed multiple times without changing its value, but the order of accesses must be maintained. The present invention optimizes the performance of this strict architectural requirement in a VLIW-like processor.
SUMMARY OF THE INVENTION
A method and system for optimizing execution of an instruction stream which includes a very long instruction word (VLIW) dispatch group in which ordering is not maintained is disclosed. The method and system comprises examining an access which initiated a flush operation; capturing an indice related to the flush operation; and causing all storage access instructions related to this indice to be dispatched as single IOP groups until the indice is updated.
Storage access to address space which is safe such as Guarded (G=1) or Direct Store (E=DS) must be handled in a non-speculative manner such that operations which could potentially go to volatile I/O devices or control locations that do not get processed out of order. Since the address is not known in the front end of the processor, this can only be determined by the load store unit or functional block which performs translation. Therefore, if a flush occurs for these conditions, in accordance with the present invention the value of the base register (RA) is latched and subsequent loads and stores which use this base register are decoded in a “safe” manner until an instruction is decoded which would change the base register value (safe means an internal instruction sequence which can be executed in order without repeating any accesses). The value of multiple base registers can be tracked in this manner, though the preferred embodiment would not use more than two, one of the base registers could be for input and one could be for output streams.


REFERENCES:
patent: 4236206 (1980-11-01), Strecker et al.
patent: 4502111 (1985-02-01), Riffe et al.
patent: 5226164 (1993-07-01), Nadas et al.
patent: 5233696 (1993-08-01), Suzuki
patent: 5689672 (1997-11-01), Witt et al.
patent: 5742783 (1998-04-01), Azmoodeh et al.
patent: 5748978 (1998-05-01), Narayan et al.
patent: 5778432 (1998-07-01), Rubin et al.
patent: 5809272 (1998-09-01), Thusoo et al.
patent: 5809273 (1998-09-01), Favor et al.
patent: 5822575 (1998-10-01), Tran
patent: 5930508 (1999-07-01), Faraboschi et al.
patent: 5961636 (1999-10-01), Brooks et al.
patent: 6032244 (2000-02-01), Moudgill
patent: 6044450 (2000-03-01), Tsushima et al.
patent: 6092176 (2000-07-01), Iadonato et al.
patent: 6108774 (2000-08-01), Muthusamy
patent: 6122722 (2000-09-01), Slavenburg
patent: 6170051 (2001-01-01), Dowling
patent: 6175910 (2001-01-01), Pauporte et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Optimization of instruction stream execution that includes a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Optimization of instruction stream execution that includes a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Optimization of instruction stream execution that includes a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2914467

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.