Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed
Reexamination Certificate
2000-09-26
2003-01-14
Ngo, Chuong Dinh (Department: 2124)
Electrical computers: arithmetic processing and calculating
Electrical digital calculating computer
Particular function performed
Reexamination Certificate
active
06507860
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to the field of digital signal processing (DSP) in field programmable gate arrays (FPGAs) and more specifically to a method of computing large Fast Fourier Transforms (FFTs) using RADIX-2 elements, through efficient utilization of distributed memory resources.
BACKGROUND OF THE INVENTION
The use of FPGAs for carrying out high speed arithmetic computations has gained recognition in recent years. FPGA architectures including logic blocks having a plurality of look-up-table (LUT) function generators, such as the XC4000™ family of devices from XILINX, Inc. (the assignee of the present invention), are particularly suited for such computations. However, many of the important digital signal processing (DSP) algorithms are multiply-intensive, and even FPGAs having a large number of logic blocks and LUTs normally cannot embed the multiplier circuits and the attendant control and support circuits in a single chip. It is therefore incumbent upon the designer to choose efficient DSP algorithms and to realize them with efficient circuit designs. The Fast Fourier Transform (FFT) is an outstanding example of an efficient DSP algorithm. Distributed arithmetic (DA) is a well-established design approach for DSP implementation in FPGAs that replaces gate-consuming array multipliers with more efficient shift and add circuits offering comparable performance.
The FFT is a highly efficient procedure for computing the Discrete Fourier Transform (DFT) of a sampled time series. The DFT, taken from a continuous waveform, is derived from and closely related to the Fourier transform and is particularly useful for digital power spectrum analysis and filtering. The FFT takes advantage of the fact that the coefficients of the DFT can be calculated iteratively, which results in a considerable savings of computation time and a substantial performance advantage over the DFT.
Distributed Arithmetic (DA) was developed as an efficient computation scheme for DSP utilizing FFTs. The DA computation algorithm is now being effectively applied to embed DSP functions in FPGAs, particularly those with coarse-grained look-up table architectures, as described in U.S. Pat. No. 6,021,423. DA enables the replacement of the array multiplier, central to many DSP applications, with a gate-efficient serial/parallel multiplier, with little or no reduction in speed.
U.S. Pat. No. 6,021,423 discloses a space-efficient DA implementation of a DSP circuit implemented in an FPGA using FFTs. In the disclosed circuit, time-invariant systems are implemented using a 16-word, SCRAM-based DA look-up table (DALUT). The DALUT contains the pre-computed values of all possible sums of coefficients, weighted by binary values of serial input data. Additional RAM resources are required for the large sine/cosine basis function database. These memory requirements are accommodated using a DALUT containing the pre-computed sums of partial products for combinations of input variables X
rm
, X
im
, X
rn
, X
in
and &thgr;
k
, as illustrated in FIG.
1
.
The highly space-efficient implementation of a RADIX-2 circuit, illustrated in FIG.
1
and described in U.S. Pat. No. 6,021,423, allows for the implementation of complex FFT circuitry in a single programmable logic device. While the implementation disclosed in the parent case provides a number of significant advantages over the prior art, there remains a need to increase the speed of circuits that benefit from the use of a plurality of RADIX-2 implementations.
The need for multiple RADIXes is apparent from a time series containing N=2
s
samples or “points” (where s is the number of stages), wherein the corresponding FFT entails 2 sN=2Nlog
2
N multiply operations. For a complete N=1024 point FFT operation on 1024 time-points, a total of 5120=(N/2*log
2
N=512* 10) RADIX-2 operations will be required. If only one RADIX-2 is used, since two cycles are required for each RADIX-2 operation, the total time required will be 10240 (5120* 2) cycles.
To reduce the number of cycles required for FFT calculations, it appears one need only increase the number of RADIX-2 elements in the circuit and use them simultaneously in each stage. However, two cycles (assuming dual-port RAM is used) are also needed to read and write variables to and from memory, and RAM read and write operations are required for every FFT function, even if additional RADIXes are used. Thus, where only a single RAM is available, there is little, if any, gain in implementation speed from the use of more than one RADIX-2 in a stage. A bottleneck in the data-rate from and to the RAM will retard the function of the circuit. Thus, using k RADIXes in a particular stage does not necessarily provide for k-times speedup of FFT calculations over a single-RADIX implementation. There is therefore a need in the art to which the present invention pertains to optimize FFT implementation for simultaneous use of a plurality of RADIXes.
SUMMARY OF THE INVENTION
To address the shortcomings of the available art, the present invention provides a method and system for partitioning RAM resources in such a manner as to optimize memory access bandwidth in a multi RADIX-2 system for FFT calculation. This enables system speed to increase in nearly direct proportion to the increase in processing speed provided by the addition of a plurality of RADIXes to the circuit. In a preferred embodiment, a plurality of memory partitions are provided for RADIXes at early stages in the circuit up to, but not including, a critical stage (pre-critical stages), while only a single RAM partition is required for each RADIX-2 in stages at and beyond the critical stage. In the preferred embodiment, a plurality of memory partitions are accessed by p/2 RADIXes in pre-critical stages of the circuit, while only a single RAM partition is accessed by all of the p RADIXes in stages at and beyond the critical stage. Multiplexing resources are preferably structured to reflect the RAM partition and RADIX interaction for each stage.
It is therefore a first advantage of the present invention to provide a method and system for designing a circuit for providing efficient simultaneous operation of a plurality of RADIX-2 elements to enable a Fast Fourier Transform (FFT) calculation, the circuit being implemented in a logic device including programmable resources for implementing the circuit, the method comprising the steps of (and the system comprising means for) assigning a plurality of memory resources, each. designated R
vw
, to a plurality of RADIX-2 elements, each designated X
ab
, assigning a first multiplexing means to selectively forward data to at least one of the RADIX-2 elements from at least one of the plurality of memory resources, assigning a second multiplexing means to selectively forward data from at least one of the RADIX-2 elements to at least one of the plurality of memory resources, whereby a RADIX-2 X
ab
receives data from memory resources R
vw
such that (vw=ab) or vw can be derived by changing either a or b from 0 to 1, and a memory resource R
vw
receives data from a RADIX-2 X
ab
whose ab is such that (ab=vw) or ab can be derived by changing either v or w from 1 to 0.
It is a further advantage of the present invention to provide a method and system for designing a circuit for providing efficient simultaneous operation of a plurality of RADIX-2 elements, the circuit being implemented in a logic device including programmable resources for implementing the circuit, the circuit processing N time samples at P RADIX-2 elements within S stages, the method comprising the steps of selecting N, P, and S for implementation of the circuit in a first pre-selected programmable logic device, the selection of N, P, and S designating a total area required for the circuit implementation, calculating a calculation time required to perform calculations within the circuit having N samples, P elements and S stages, and modifying either of N, P:, and S to change either of the total area required and
Nag Sudip K.
Verma Hare K.
Cartier Lois D.
Heafy Crosby
Ngo Chuong Dinh
Tachner Adam H.
Xilinx , Inc.
LandOfFree
System and method for RAM-partitioning to exploit... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for RAM-partitioning to exploit..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for RAM-partitioning to exploit... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3028110