Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed
Reexamination Certificate
2000-07-10
2004-12-14
Mai, Tan V. (Department: 2124)
Electrical computers: arithmetic processing and calculating
Electrical digital calculating computer
Particular function performed
C708S402000
Reexamination Certificate
active
06832232
ABSTRACT:
BACKGROUND
1. Field of the Invention
The present invention relates generally to systems and methods for performing discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) operations. The invention also relates to digital video compression and decompression, and more particularly to a video encoder and decoder for performing the discrete cosine transform and/or inverse discrete cosine transform with improved efficiency and reduced computational requirements.
2. Description of the Related Art
DSP theory provides a host of tools for the analysis and representation of signal data. The discrete cosine transform and its inverse are among the more ubiquitous of these tools in multimedia applications. The discrete cosine transform (DCT) of a discrete function f(j), j=0, 1, . . . N-1 is defined as 
F
⁡
(
k
)
=
2
⁢
c
⁡
(
k
)
N
⁢
∑
j
=
0
N
-
1
⁢
 
⁢
f
⁡
(
j
)
·
cos
⁡
[
(
2
⁢
j
+
1
)
⁢
k
⁢
 
⁢
π
2
⁢
N
]
,
where k=0,1, . . . , N−1, and 
c
⁡
(
k
)
=
{
1
2
for
⁢
 
⁢
k
=
0
1
for
⁢
 
⁢
k
≠
0
}
.
The inverse discrete cosine transform (IDCT) is defined by 
f
⁡
(
j
)
=
∑
k
=
0
N
-
1
⁢
 
⁢
c
⁡
(
k
)
⁢
F
⁡
(
k
)
⁢
cos
⁡
[
(
2
⁢
j
+
1
)
⁢
k
⁢
 
⁢
π
2
⁢
N
]
,
where j=0, 1, . . . N−1.
The discrete cosine transform may be used in a wide variety of applications and allows an arbitrary input array size. However, the straightforward DCT algorithm is often prohibitively time-consuming especially when executed on general purpose processors. In 1977, Chen et al. disclosed an efficient algorithm for performing the DCT in an article entitled “A Fast Computational Algorithm for the Discrete Cosine Transform”, published in IEEE Transactions on Communications, Vol. COM-25, No. 9, September, 1977, authored by Wen-Hsiung Chen, C. Harrison Smith and S. C. Fralick, which is hereby incorporated by reference. Fast DCT algorithms such as that disclosed by Chen et al. are significantly more efficient that the straightforward DCT algorithm. Nevertheless, there remains room for improvement, particularly when the algorithm is employed in specific circumstances.
Traditional ×86 processors are not well adapted for the types of calculations used in signal processing. Thus, signal processing software applications on traditional ×86 processors have lagged behind what was realizable on other processor architectures. There have been various attempts to improve the signal processing performance of ×86-based systems. For example, microcontrollers optimized for digital signal processing computations (DSPs) have been provided on plug-in cards or the motherboard. These microcontrollers operated essentially as hardwired coprocessors enabling the system to perform signal processing functions.
As multimedia applications become more sophisticated, the demands placed on computers are redoubled. Microprocessors are now routinely provided with enhanced support for these applications. For example, many processors now support single-instruction multiple-data (SIMD) commands such as MX instructions. Advanced Micro Devices, Inc. (hereinafter referred to as AMD) has proposed and implemented 3DNow!™, a set of floating point SIMD instructions on ×86 processors starting with the AMD-K6®-2. The AMD-K6®-2 is highly optimized to execute the 3DNow!™ instructions with minimum latency. Software applications written for execution on the AMD-K6®-2 may use these instructions to accomplish signal processing functions and the traditional ×86 instructions to accomplish other desired functions.
The 3DNow! instructions, being SIMD commands, are “vectored” instructions in which a single operation is performed on multiple data operands. Such instructions are very efficient for graphics and audio applications where simple operations are repeated on each sample in a stream of data. SIMD commands invoke parallel execution in superscalar microprocessors where pipelining and/or multiple execution units are provided.
Vectored instructions typically have operands that are partitioned into separate sections, each of which is independently operated upon. For example, a vectored multiply instruction may operate upon a pair of 32-bit operands, each of which is partitioned into two 16-bit sections or four 8-bit sections. Upon execution of a vectored multiply instruction, corresponding sections of each operand are independently multiplied. 
FIG. 1
 illustrates the differences between a scalar (i.e., non-vectored) multiplication and a vector multiplication. To quickly execute vectored multiply instructions, microprocessors such as the AMD-K6®-2 use a number of multipliers in parallel.
FIG. 2
 illustrates one embodiment of a representative computer system 
100
 such as the AMD-K6®-2 which is configured to support the execution of general-purpose instructions and parallel floating-point instructions. Computer system 
100
 may comprise a microprocessor 
110
, memory 
112
, bus bridge 
114
, peripheral bus 
116
, and a plurality of peripheral devices P1-PN. Bus bridge 
114
 couples to microprocessor 
10
, memory 
112
 and peripheral bus 
116
. Bus bridge 
114
 mediates the exchange of data between microprocessor 
110
, memory 
112
 and peripheral devices P1-PN.
Microprocessor 
110
 is a superscalar microprocessor configured to execute instructions if a variable length instruction set. A subset of the variable length instruction set is the set of SIMD (single-instruction multiple-data) floating-point instructions. Microprocessor 
110
 is optimized to execute the SIMD floating-point instructions in a single clock cycle. In addition, the variable length instruction set includes a set of ×86 instructions (e.g. the instructions defined by the 80486 processor architecture).
Memory 
112
 stores program instructions which control the operation of microprocessor 
110
. Memory 
112
 additionally stores input data to be operated on by microprocessor 
110
, and output data generated by microprocessor 
110
, in response to the program instructions. Peripheral devices P
1
-PN are representative of devices such as network interface cards (e.g. Ethernet cards), modems, sound cards, video acquisition boards, data acquisition cards, external storage media, etc. Computer system 
100
 may be a personal computer, a laptop computer, a portable computer, a television, a radio receiver and/or transmitter, etc.
FIG. 3
 illustrates one embodiment for microprocessor 
110
. Microprocessor 
110
 may be configured with 3DNow!™ and MMX® technologies. Microprocessor 
110
 may comprise bus interface unit 
224
, predecode unit 
212
, instruction cache 
214
, decode unit 
220
, execution engine 
230
, and data cache 
226
. Microprocessor 
110
 may also include store queue 
238
 and an L
2
 cache 
240
. Additionally, microprocessor 
110
 may include a branch prediction unit and a branch resolution unit (not shown) to allow efficient speculative execution.
Predecode unit 
212
 may be coupled to instruction cache 
214
, which stores instructions received from memory 
112
 via bus interface unit 
224
 and predecode unit 
212
. Instruction cache 
214
 may also contain a predecode cache (not shown) for storing predecode information. Decode unit 
220
 may receive instructions and predecode information from instruction cache 
214
 and decode the instructions into component pieces. The component pieces may be forwarded to execution engine 
230
. The component pieces may be RISC operands. (Microprocessor 
110
 may be RISC-based superscalar microprocessor). RISC ops are fixed-format internal instructions, most of which are executable by microprocessor 
10
 in a single clock cycle. RISC operations may be combined to form every function of the ×86 instruction set.
Execution engine 
230
 may execute the decoded instructions in response to the component pieces received from decode unit 
220
. As shown in 
FIG. 4
, execution engine 
230
 may include a scheduler buff
Gorishek Frank J.
Hus Wei-Lien
Liu Yi
Kivlin B. Noäl
Meyertons Hood Kivlin Kowert & Goetzel P.C.
LandOfFree
Dual-block inverse discrete cosine transform method does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Dual-block inverse discrete cosine transform method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Dual-block inverse discrete cosine transform method will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3312124