Electrical computers and digital processing systems: multicomput – Computer-to-computer data routing – Least weight routing
Reexamination Certificate
1999-10-01
2002-12-10
Kim, Kenneth S. (Department: 2183)
Electrical computers and digital processing systems: multicomput
Computer-to-computer data routing
Least weight routing
C709S241000, C710S200000, C711S151000, C712S228000
Reexamination Certificate
active
06493741
ABSTRACT:
BACKGROUND OF THE INVENTION
A “thread” is a stream of instructions being executed by a processor. Software that is multithreaded has multiple threads of control that cooperate to perform a task.
A simultaneous multithreaded (SMT) central processor unit (CPU) provides, on a single CPU, the capability of executing instructions from multiple threads simultaneously.
On a simultaneously-multithreaded processor, the hardware provides facilities for executing multiple threads as if each thread were executing on its own CPU. This abstract thread processor is called a thread processing unit, or TPU. To the outside world, a TPU has all the capabilities of a conventional CPU. It holds a full process context while a process or thread is executing on that TPU. The term “processor” is used herein to refer either to a TPU or a conventional CPU.
For example, a 4-way issue CPU might have two functional units executing instructions from one thread, while the other two functional units are executing instructions from an unrelated thread. This is accomplished by providing enough registers and other process-specific resources on the CPU to support as many threads as can run simultaneously, and then choosing among the threads to determine which specific instructions will be executed. The threads may be related, where they are cooperatively doing work, or they may be entirely unrelated.
FIG. 1
compares sample execution sequences for superscalar, multithreading, and simultaneous multithreading architectures. Each row represents the issue slots for a single execution cycle: a filled box indicates that the processor found an instruction to execute in that issue slot on that cycle. An empty box denotes an unused slot. The unused slots can be characterized as horizontal or vertical waste. Horizontal waste occurs when some, but not all, of the issue slots in a cycle can be used. It typically occurs because of poor instruction-level parallelism. Vertical waste occurs when a cycle goes completely unused. This can be caused by a long latency instruction, such as a memory access, that inhibits further instruction issue.
Sequence (a)
2
corresponds to a conventional superscalar. As in all superscalars, it is executing a single program, or thread, from which it attempts to find multiple instructions to issue each cycle. When it cannot, the issue slots go unused, and both horizontal
3
A and vertical waste
3
B are incurred.
Sequence (b)
4
corresponds to a multithreaded architecture. Multithreaded processors contain hardware state, i.e., a program counter and registers, for several threads. On any given cycle, a processor executes instructions from just one of the threads. On the next cycle, it switches to a different thread context and executes instructions from the new thread. The primary advantage of multithreaded processors is that they better tolerate long-latency operations, effectively eliminating vertical waste. However, they cannot removed horizontal waste. Consequently, as instruction issue width continues to increase, multithreaded architectures will ultimately suffer the same fate as superscalars: they will be limited by the instruction-level parallelism in a single thread.
Sequence (c)
6
corresponds to a simultaneous multithreaded architecture and shows how each cycle in an SMT processor selects instructions for execution from all threads. It exploits instruction-level parallelism by selecting instructions from any thread that can potentially issue. The processor then dynamically schedules machine resources among the instructions, providing the greatest chance for the highest hardware utilization. If one thread has high instruction-level parallelism, that parallelism can be satisfied; if multiple threads each have low instruction-level parallelism, they can be executed simultaneously to compensate. In this way, SMT can recover issue slots lost to both horizontal and vertical waste.
Simultaneous multithreading is advantageous because it allows the CPU to get better throughput. Resources which would lie idle due to limited parallelism in one thread can be utilized by other threads.
A software program can be compiled, or decomposed, into multiple threads, with the purpose of achieving improved performance through parallel execution of those threads. The threads may be executed on different processors in a multiprocessor, or they may be executed on different thread processing units within an SMT CPU.
When programs are multithreaded in this way, locking protocols are used to control access to shared data. Assigning a special memory location, called a lock, to a section of data, controls access to that section of data. A thread can only update the data when it owns the lock.
An integral part of many locking protocols is a busy wait loop, often referred to as a “spin lock.” In a spin lock, a process loops, looking at a particular memory location, i.e., the lock, and waiting for it to change to a specific value before proceeding. Once the value has changed, the process is then free to attempt to obtain the lock via an atomic update of the location
SUMMARY
In a conventional multiprocessor, the CPU resources and memory bandwidth consumed by a task in a spin lock are not simultaneously shared with any other tasks. Thus, while the task is spinning there is no resource contention within the CPU, and no reason not to let the task spin. Various studies have shown that approximately 15% of processor time is spent in spin loops.
In a simultaneous multithreaded CPU, however, the resources consumed by the spinning task are being denied to the other threads that are or could be doing useful work. In fact, Applicants have found that under these circumstances, the simultaneously multithreaded CPU provided no performance increase to the decomposed application, and can actually degrade the performance of the application.
One multithreaded computer uses fine-grained multithreading, which is different from SMT, and addresses the synchronization problem with a hardware retry which traps the thread after some number of failures and deschedules it. This is described in “Exploiting Heterogeneous Parallelism on a Multithreaded Multiprocessor,” 1992, which can be found at web site tera.com/www/archives/library/psdocs.html.
U.S. patent application Ser. No. 08/775,553 by Emer et al, “A Multi-threaded Processor and Method That Selects Threads Based On An Attribute,” (name amended in February, 1999), filed Dec. 31, 1996, assigned to a common assignee as the present invention and incorporated by reference herein in its entirety, describes an SMT architecture.
Many papers have been published about Simultaneous Multithreading. For a fairly complete list, see web site cs.washington.edu/research/smt/. The University of Washington has done much work on efficient synchronization on SMT. See, for example, “Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor,” 1995, available at web site cs.washington.edu/research/smt/papers/hpca.ps. A longer version of the paper, UCSD CSE Technical Report #CS98-587, is available at web site cs.washington.edu/research/smt/papers/smt.synch.ps.
These papers propose a synchronization “lock-box” mechanism which has the primary goal of providing faster synchronization between threads. The lock itself is memory-based, but once the lock is obtained by a thread on a particular CPU, the lock-box passes the lock among the threads on that CPU, if they require it. If a thread fails to acquire a lock and must wait for it to become available, the thread's instructions are flushed from the pipeline to prevent that thread consuming resources on the CPU. The mechanisms of synchronizing and possibly flushing the instructions are combined into one “Acquire” instruction, and all actions required by that instruction are carried out strictly in hardware.
U.S. Pat. No. 5,524,247 is a software patent on scheduling to avoid locks. It does not involve hardware and it is not related to SMT architecture.
The present invention resolves the problem of spin-lock in an SMT architecture by haltin
Edwards Bruce E.
Emer Joel S.
Fossum Tryggve
Hicks, Jr. James E.
Joerg Christopher F.
Compaq Information Technologies Group L.P.
Hamilton Brook Smith & Reynolds P.C.
Kim Kenneth S.
LandOfFree
Method and apparatus to quiesce a portion of a simultaneous... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus to quiesce a portion of a simultaneous..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus to quiesce a portion of a simultaneous... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2985597