Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
1998-12-30
2001-10-30
{haeck over (S)}mits, T{overscore (a)}livaldis Ivars (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S222000, C704S223000
Reexamination Certificate
active
06311154
ABSTRACT:
FIELD OF THE INVENTION
This invention relates generally to digital communications and, in particular, to speech or voice coder (vocoder) and decoder methods and apparatus.
BACKGROUND OF THE INVENTION
One type of voice communications system of interest to the teaching of this invention uses a code division, multiple access (CDMA) technique such as one originally defined by the EIA Interim Standard IS-95A, and in later revisions thereof and enhancements thereto. This CDMA system is based on a digital spread-spectrum technology which transmits multiple, independent user signals across a single 1.25 MHz segment of radio spectrum. In CDMA, each user signal includes a different orthogonal code and a pseudo-random binary sequence that modulates a carrier, spreading the spectrum of the waveform, and thus allowing a large number of user signals to share the same frequency spectrum. The user signals are separated in the receiver with a correlator which allows only the signal energy from the selected orthogonal code to be de-spread. The other users signals, whose codes do not match, are not de-spread and, as such, contribute only to noise and thus represent a self-interference generated by the system. The SNR of the system is determined by the ratio of desired signal power to the sum of the power of all interfering signals, enhanced by the system processing gain or the spread bandwidth to the baseband data rate.
The CDMA system as defined in IS-95A uses a variable rate voice coding algorithm in which the data rate can change dynamically on a 20 millisecond frame by frame basis as a function of the speech pattern (voice activity). The Traffic Channel frames can be transmitted at full, ½, ¼ or ⅛ rate (9600, 4800, 2400 and 1200 bps, respectively). With each lower data rate, the transmitted power (E
s
) is lowered proportionally, thus enabling an increase in the number of user signals in the channel.
Toll quality speech reproduction at low bit rates (e.g., around 4,000 bits per second (4 kb/s) and lower, such as 4, 2 and 0.8 kb/s) has proven to be a difficult task. Despite efforts made by many speech researchers, the quality of speech that is coded at low bit rates is typically not adequate for wireless and network applications. In the conventional CELP algorithm, the excitation is not efficiently generated and the periodicity existing in the residual signal during voiced intervals is not appropriately exploited. Moreover, CELP coders and their derivatives have not shown satisfactory subjective performance at low bit rates.
In a conventional analysis-by-synthesis (“AbS”) coding of speech, the speech waveform is partitioned into a sequence of successive frames. Each frame has a fixed length and is partitioned into an integer number of equal length subframes. The encoder generates an excitation signal by a trial and error search process whereby each candidate excitation for a subframe is applied to a synthesis filter, and the resulting segment of synthesized speech is compared with a corresponding segment of target speech. A measure of distortion is computed and a search mechanism identifies the best (or nearly best) choice of excitation for each subframe among an allowed set of candidates. Since the candidates are sometimes stored as vectors in a codebook, the coding method is referred to as code excited linear prediction (CELP). At other times, the candidates are generated as they are needed for the search by a predetermined generating mechanism. This case includes, in particular, multi-pulse linear predictive coding (MP-LPC) or algebraic code excited linear prediction (ACELP). The bits needed to specify the chosen excitation subframe are part of the package of data that is transmitted to the receiver in each frame.
Usually the excitation is formed in two stages, where a first approximation to the excitation subframe is selected from an adaptive codebook which contains past excitation vectors, and then a modified target signal is formed as the new target for a second AbS search operation which uses the above described procedure.
In Relaxation CELP (RCELP) in the Enhanced Variable Rate Coder (TIA/EIA/IS-127) the input speech signal is modified through a process of time warping to ensure that it conforms to a simplified (linear) pitch contour. The modification is performed as follows.
The speech signal is divided into frames and linear prediction is performed to generate a residual signal. A pitch analysis of the residual signal is then performed, and an integer pitch value, computed once per frame, is transmitted to the decoder. The transmitted pitch value is interpolated to obtain a sample-by-sample estimate of the pitch, defined as the pitch contour. Next, the residual signal is modified at the encoder to generate a modified residual signal, which is perceptually similar to the original residual. In addition, the modified residual signal exhibits a strong correlation between samples separated by one pitch period (as defined by the pitch contour). The modified residual signal is filtered through a synthesis filter derived from the linear prediction coefficients, to obtain the modified speech signal. The modification of the residual signal may be accomplished in a manner described in U.S. Pat. No. 5,704,003.
The standard encoding (search) procedure for RCELP is similar to regular CELP except for two important differences. First, the RCELP adaptive excitation is obtained by time-warping the past encoded excitation signal using the pitch contour. Second, the analysis-by-synthesis objective in RCELP is to obtain the best possible match between the synthetic speech and the modified speech signal.
OBJECTS AND ADVANTAGES OF THE INVENTION
It is a first object and advantage of this invention to provide a method and circuitry that implements a vocoder of the analysis-by-synthesis (AbS) type having adaptively modified subframe boundaries and adaptively determined window sizes and locations within subframes.
It is a second object and advantage of this invention to provide a time-domain real-time speech coding/decoding system based at least in part on a code excited linear prediction (CELP) type algorithm, the speech coding/decoding system using adaptive windows.
It is a further object and advantage of this invention to provide an algorithm and a corresponding apparatus that overcome many of the foregoing problems by employing a novel excitation encoding scheme with a CELP or Relaxation CELP (RCELP) model wherein a pattern classifier is used to determine a classification that best describes the character of the speech signal in each frame, and to then encode the fixed excitation using class-specific structured codebooks.
It is another object and advantage of this invention to provide a method and circuitry that implements a speech coder of the analysis-by-synthesis (AbS) type, wherein the use of adaptive windows enables a relatively limited number of bits to be more efficiently allocated to describe the excitation signal, which results in enhanced speech quality compared to the conventional use of CELP-type coders at bit rates as low as 4 kbps or lower.
SUMMARY OF THE INVENTION
The foregoing and other problems are overcome and the objects and advantages of the invention are realized by methods and apparatus that provide an improved time domain, CELP-type voice coder/decoder.
A presently preferred speech coding model uses a novel class-dependent approach for generating and encoding the fixed codebook excitation. The model preserves the RCELP approach to generate and encode efficiently the adaptive codebook contribution for voiced frames. However, the model introduces different excitation encoding strategies for each of a plurality of residual signal classes, such as voiced, transition, and unvoiced, or for strongly periodic, weakly periodic, erratic (transition), and unvoiced. The model employs a classifier that provides for a closed-loop transition/voiced selection. The fixed-codebook excitation for voiced frames is based on an enhanced adaptive window approach, which is shown to be
Ahmadi Sassan
Cuperman Vladimir
Gersho Allen
Liu Fenghua
Rao Ajit V
Nokia Mobile Phones Limited
Patel Milan I.
Smith Harry
{haeck over (S)}mits T{overscore (a)}livaldis Ivars
LandOfFree
Adaptive windows for analysis-by-synthesis CELP-type speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Adaptive windows for analysis-by-synthesis CELP-type speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Adaptive windows for analysis-by-synthesis CELP-type speech... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2578996