Method of encoding a speech signal

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S203000, C704S207000, C704S208000

Reexamination Certificate

active

06269332

ABSTRACT:

This invention relates to a method of and apparatus for encoding a speech signal, more particularly, but not exclusively, for encoding speech for low bit rate transmission and storage.
BACKGROUND OF THE INVENTION
In many audio applications it is desired to transfer or store digitally an audio signal for example a speech signal. Rather than attempting to sample and subsequently reproduce a speech signal directly, a vocoder is often employed which constructs a synthetic speech signal containing the key features of the audio signal, the synthetic signal being then decoded for reproduction.
A coding algorithm that has been proposed for use with a vocoder user a speech model called the Multi-Band Excitation (MBE) model, first proposed in the paper “Multi-Band Excitation Vocoder” by Griffin and Lim, IEEE Transactions on Acoustics, Speech and Signal Processing Volume 36 No. 8 August 1988 Page 1223. The MBE model divides the speech signal into a plurality of frames which are analyzed independently to produce a set of parameters modelling the speech signal at that frame, the parameters being subsequently encoded for transmission/storage. The speech signal in each frame is divided into a number of frequency bands and for each frequency band a decision is made whether that portion of the spectrum is voiced or unvoiced and then represented by either periodic energy, for a voiced decision or noise-like energy for an unvoiced decision. The speech signal in each frame is characterised, using the model, by information comprising the fundamental frequency of the speech signal in the frame, voiced/unvoiced decisions for the frequency bands and the corresponding amplitudes for the harmonics in each band. This information is then transformed and vector quantized to provide the encoder output. The output is decoded by reversing this procedure. A proposal for implementation of a vocoder using the multi-band excitation model may be found in the Inmarsat-M Voice Codec, Version 3, August 1991 SDM/M Mod. 1/Appendix 1 (Digital Voice System Inc.).
It is a problem for implementation of such a vocoder that the fundamental pitch period and the number of harmonics changes from frame to frame, since these features are functions of the talker. For example, male speech generally has a lower fundamental frequency, with more harmonic components whereas female speech has a higher fundamental frequency with fewer harmonics. This causes a variable-dimension vector quantization problem. One proposed solution to the problem is to truncate the speech signal by selecting only a predetermined number of harmonics. However, such an approach causes unacceptable speech degradation particularly when recognition of the speaker of the reconstructed speech signal is desired.
A proposal to alleviate this problem is the use of Non-Square Transform (NST) vector-quantization as proposed by Lupini and Cuperman in IEEE Signal Processing Letters, Volume 3, No. 1, January 1996 and Cuperman, Lupini and Bhattacharya in the paper “Spectral Excitation Coding of Speech at 2.4 kb/s” Proceedings, IEEE International Conference on Acoustics, Speech and Signal Processing Volume 1. With this approach, the NST transforms the varying number of spectral harmonic amplitudes to a fixed number of transform coefficients which are then vector-quantized.
It is a disadvantage of this proposal, however, that very high computational complexity is involved in the Non-Square Transform operation. This is because the transformation of the varying-dimension vectors into either fixed 30 or 40 dimension vectors of this proposal is highly computationally intensive and requires a large memory to store all the elements of the transform matrices. The recommended fixed dimensional vector requires a one stage quantization which is also computationally expensive. It is a further disadvantage of NST vector quantization that the technique introduces distortion in the speech signal which degrades the perceptual quality of reproduced speech when the size of the codebook of the vector quantizers is small.
In some applications it is desired to encode the speech at a low bit rate, for example 2.4 kbps or less. A speech signal encoded in this way requires less memory to store the signal digitally, thus keeping the cost of a device using the bit rate. However, the use of NST vector quantization with the consequent requirements of high computational power and memory together with the problem of distortion does not provide a feasible solution to the problem of low cost encoding and storage of speech at such low bit rates.
It is the object of the invention to provide a method of an apparatus for speech coding which alleviates at least one of the disadvantages of the prior art.
SUMMARY OF THE INVENTION
According to the invention in the first aspect, there is provided a method of encoding a speech signal comprising the steps of:
sampling the speech signal;
dividing the sample speech signal into a plurality of frames;
performing multi-band excitation analysis on the signal within each frame to derive a fundamental pitch, a plurality of voiced/unvoiced decisions for frequency bands in the signal and amplitudes of harmonics within said bands;
transforming the harmonic amplitudes to form a plurality of transform coefficients;
vector quantizing the coefficients to form a plurality of indices; characterised by
dividing the harmonic amplitudes into a first group of a fixed number of harmonics and a second group of the remainder of the harmonics, the first and second groups being subject to different transforms to form respective first and second sets of transform coefficients for quantization.
Preferably the first transform is a Discrete Cosine Transform (DCT) which transforms the first predetermined number of harmonics into the same number of first transform coefficients. The second transform is preferably a Non-Square Transform (NST), transforming the remainder of the harmonics into a fixed number of second transform coefficients.
Most preferably, the first group comprises the first 8 harmonics of the audio signal which are transformed into 8 transform coefficients and the second group comprising the remainder of the harmonics which are also transformed into 8 transform coefficients.
With the method of the invention, the first group of harmonics is selected to be the most important harmonics for the purpose of recognising the reconstructed speech signal. Since the number of such harmonics is fixed, it is possible to use a fixed dimension transform such as the DCT thus minimising distortion and keeping the dimension of the most important parameters unchanged. On the other hand, the remaining less important harmonics are transformed using the NST variable dimension transform. Since only the less significant harmonics are transformed using the NST, the effect of distortion on reproducibility of the audio signal is minimised.
Furthermore, since the harmonics are split into two groups, the degree of computational power necessary to transform and encode the consequently smaller vectors is less, thus reducing the computational power needed for the encoder.
According to the invention in a second aspect, there is provided a method of decoding an input data signal for speech synthesis comprising the steps of:
vector dequantizing a plurality of indices of the data signal to form first and second sets of transform coefficients;
transforming the first and second sets of coefficients to derive respective first and second groups of harmonic amplitudes;
deriving pitch and voiced/unvoiced decision information from the input data signal;
performing multi-band excitation analysis on the information and the harmonic amplitudes to form a synthesized signal; and constructing a speech signal from the synthesized signal.
According to the invention in a third aspect, there is provided speech coding apparatus comprising:
means for sampling a speech signal and dividing the sampled signal into a plurality of frames;
a multi-band excitation analyzer for deriving a fundamental pitch and a plurality of voiced / unvoiced

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method of encoding a speech signal does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method of encoding a speech signal, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of encoding a speech signal will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2444727

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.