Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed
Reexamination Certificate
2000-07-10
2004-12-14
Mai, Tan V. (Department: 2124)
Electrical computers: arithmetic processing and calculating
Electrical digital calculating computer
Particular function performed
C708S402000
Reexamination Certificate
active
06832232
ABSTRACT:
BACKGROUND
1. Field of the Invention
The present invention relates generally to systems and methods for performing discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) operations. The invention also relates to digital video compression and decompression, and more particularly to a video encoder and decoder for performing the discrete cosine transform and/or inverse discrete cosine transform with improved efficiency and reduced computational requirements.
2. Description of the Related Art
DSP theory provides a host of tools for the analysis and representation of signal data. The discrete cosine transform and its inverse are among the more ubiquitous of these tools in multimedia applications. The discrete cosine transform (DCT) of a discrete function f(j), j=0, 1, . . . N-1 is defined as
F
⁡
(
k
)
=
2
⁢
c
⁡
(
k
)
N
⁢
∑
j
=
0
N
-
1
⁢
⁢
f
⁡
(
j
)
·
cos
⁡
[
(
2
⁢
j
+
1
)
⁢
k
⁢
⁢
π
2
⁢
N
]
,
where k=0,1, . . . , N−1, and
c
⁡
(
k
)
=
{
1
2
for
⁢
⁢
k
=
0
1
for
⁢
⁢
k
≠
0
}
.
The inverse discrete cosine transform (IDCT) is defined by
f
⁡
(
j
)
=
∑
k
=
0
N
-
1
⁢
⁢
c
⁡
(
k
)
⁢
F
⁡
(
k
)
⁢
cos
⁡
[
(
2
⁢
j
+
1
)
⁢
k
⁢
⁢
π
2
⁢
N
]
,
where j=0, 1, . . . N−1.
The discrete cosine transform may be used in a wide variety of applications and allows an arbitrary input array size. However, the straightforward DCT algorithm is often prohibitively time-consuming especially when executed on general purpose processors. In 1977, Chen et al. disclosed an efficient algorithm for performing the DCT in an article entitled “A Fast Computational Algorithm for the Discrete Cosine Transform”, published in IEEE Transactions on Communications, Vol. COM-25, No. 9, September, 1977, authored by Wen-Hsiung Chen, C. Harrison Smith and S. C. Fralick, which is hereby incorporated by reference. Fast DCT algorithms such as that disclosed by Chen et al. are significantly more efficient that the straightforward DCT algorithm. Nevertheless, there remains room for improvement, particularly when the algorithm is employed in specific circumstances.
Traditional ×86 processors are not well adapted for the types of calculations used in signal processing. Thus, signal processing software applications on traditional ×86 processors have lagged behind what was realizable on other processor architectures. There have been various attempts to improve the signal processing performance of ×86-based systems. For example, microcontrollers optimized for digital signal processing computations (DSPs) have been provided on plug-in cards or the motherboard. These microcontrollers operated essentially as hardwired coprocessors enabling the system to perform signal processing functions.
As multimedia applications become more sophisticated, the demands placed on computers are redoubled. Microprocessors are now routinely provided with enhanced support for these applications. For example, many processors now support single-instruction multiple-data (SIMD) commands such as MX instructions. Advanced Micro Devices, Inc. (hereinafter referred to as AMD) has proposed and implemented 3DNow!™, a set of floating point SIMD instructions on ×86 processors starting with the AMD-K6®-2. The AMD-K6®-2 is highly optimized to execute the 3DNow!™ instructions with minimum latency. Software applications written for execution on the AMD-K6®-2 may use these instructions to accomplish signal processing functions and the traditional ×86 instructions to accomplish other desired functions.
The 3DNow! instructions, being SIMD commands, are “vectored” instructions in which a single operation is performed on multiple data operands. Such instructions are very efficient for graphics and audio applications where simple operations are repeated on each sample in a stream of data. SIMD commands invoke parallel execution in superscalar microprocessors where pipelining and/or multiple execution units are provided.
Vectored instructions typically have operands that are partitioned into separate sections, each of which is independently operated upon. For example, a vectored multiply instruction may operate upon a pair of 32-bit operands, each of which is partitioned into two 16-bit sections or four 8-bit sections. Upon execution of a vectored multiply instruction, corresponding sections of each operand are independently multiplied.
FIG. 1
illustrates the differences between a scalar (i.e., non-vectored) multiplication and a vector multiplication. To quickly execute vectored multiply instructions, microprocessors such as the AMD-K6®-2 use a number of multipliers in parallel.
FIG. 2
illustrates one embodiment of a representative computer system
100
such as the AMD-K6®-2 which is configured to support the execution of general-purpose instructions and parallel floating-point instructions. Computer system
100
may comprise a microprocessor
110
, memory
112
, bus bridge
114
, peripheral bus
116
, and a plurality of peripheral devices P1-PN. Bus bridge
114
couples to microprocessor
10
, memory
112
and peripheral bus
116
. Bus bridge
114
mediates the exchange of data between microprocessor
110
, memory
112
and peripheral devices P1-PN.
Microprocessor
110
is a superscalar microprocessor configured to execute instructions if a variable length instruction set. A subset of the variable length instruction set is the set of SIMD (single-instruction multiple-data) floating-point instructions. Microprocessor
110
is optimized to execute the SIMD floating-point instructions in a single clock cycle. In addition, the variable length instruction set includes a set of ×86 instructions (e.g. the instructions defined by the 80486 processor architecture).
Memory
112
stores program instructions which control the operation of microprocessor
110
. Memory
112
additionally stores input data to be operated on by microprocessor
110
, and output data generated by microprocessor
110
, in response to the program instructions. Peripheral devices P
1
-PN are representative of devices such as network interface cards (e.g. Ethernet cards), modems, sound cards, video acquisition boards, data acquisition cards, external storage media, etc. Computer system
100
may be a personal computer, a laptop computer, a portable computer, a television, a radio receiver and/or transmitter, etc.
FIG. 3
illustrates one embodiment for microprocessor
110
. Microprocessor
110
may be configured with 3DNow!™ and MMX® technologies. Microprocessor
110
may comprise bus interface unit
224
, predecode unit
212
, instruction cache
214
, decode unit
220
, execution engine
230
, and data cache
226
. Microprocessor
110
may also include store queue
238
and an L
2
cache
240
. Additionally, microprocessor
110
may include a branch prediction unit and a branch resolution unit (not shown) to allow efficient speculative execution.
Predecode unit
212
may be coupled to instruction cache
214
, which stores instructions received from memory
112
via bus interface unit
224
and predecode unit
212
. Instruction cache
214
may also contain a predecode cache (not shown) for storing predecode information. Decode unit
220
may receive instructions and predecode information from instruction cache
214
and decode the instructions into component pieces. The component pieces may be forwarded to execution engine
230
. The component pieces may be RISC operands. (Microprocessor
110
may be RISC-based superscalar microprocessor). RISC ops are fixed-format internal instructions, most of which are executable by microprocessor
10
in a single clock cycle. RISC operations may be combined to form every function of the ×86 instruction set.
Execution engine
230
may execute the decoded instructions in response to the component pieces received from decode unit
220
. As shown in
FIG. 4
, execution engine
230
may include a scheduler buff
Gorishek Frank J.
Hus Wei-Lien
Liu Yi
Kivlin B. Noäl
Meyertons Hood Kivlin Kowert & Goetzel P.C.
LandOfFree
Dual-block inverse discrete cosine transform method does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Dual-block inverse discrete cosine transform method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Dual-block inverse discrete cosine transform method will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3312124