Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
2002-11-12
2004-01-13
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S243000
Reexamination Certificate
active
06678655
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to low bit rate speech coding and to speech recognition for the purpose of speech to text conversion.
REFERENCES
In the following description reference is made to the following publications:
[1] S. Davis and P. Mermelstein, “
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”,
IEEE Trans ASSP, Vol. 28, No. 4, pp. 357-366, 1980.
[2] S. Young, “
A review of large-vocabulary continuous-speech recognition”, IEEE signal processing magazine,
pp 45-47, September 1996.
[3] McAulay, R. J. Quatieri, T. F. “
Speech analysis-synthesis based on a sinusoidal representation”, IEEE Trans ASSP,
Vol. 34, No. 4, pp. 744-754, 1986.
[4] McAulay, R. J. Quatieri, T. F. “
Sinusoidal coding” in W. Kleijn and K. Paliwal Editors “
Speech Coding and Synthesis ”,
ch. 4, pp. 121-170,
Elsevier
1995.
[5] Y. Medan, E. Yair and D. Chazan, “
Super resolution pitch determination of speech signals”, IEEE Trans ASSP,
Vol. 39, No. 1, pp. 40-48, 1991.
[6] W. Hess, “
Pitch Determination of Speech Signals”, Springer-Verlag,
1983.
[7] G. Ramaswamy and P. Gopalakrishnan, “
Compression of acoustic features for speech recognition in network environment”, Proceedings of ICASSP
1998.
BACKGROUND OF THE INVENTION
In digital transmission of speech, usually a speech coding scheme is utilized. At the receiver the speech is decoded so that a human listener can listen to it. The decoded speech may also serve as an input to a speech recognition system. Low bit rate coding used to transmit speech through a limited bandwidth channel may impair the recognition accuracy compared to the usage of non-compressed speech. Moreover, the necessity to decode the speech introduces a computational overhead to the recognition process.
A similar problem occurs when the coded speech is stored for later playback and deferred recognition, e.g., in a hand-held device, where the storage is limited.
It is therefore desirable to encode speech at a low bit-rate so that:
1. Speech may be decoded from the encoded bit-stream (for a human listener); and
2. A recognition system may use the decoded bit-stream, with no impairment of the recognition accuracy or computational overhead.
SUMMARY OF THE INVENTION
It is therefore an object of the invention to provide a method for encoding speech at a low bit-rate to produce a bit stream which may be decoded as audible speech.
This object is realized in accordance with a first aspect of the invention by a method for encoding a digitized speech signal so as to generate data capable of being decoded as speech, said method comprising the steps of
(a) converting the digitized speech signal to a series of feature vectors by:
i) deriving at successive instances of time an estimate of the spectral envelope of the digitized speech signal,
ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window occupies a narrow range of frequencies, and computing the integrals thereof, and
iii) assigning said integrals or a set of predetermined functions thereof to respective components of a corresponding feature vector in said series of feature vectors;
(b) computing for each instance of time a respective pitch value of the digitized speech signal, and
(c) compressing successive acoustic vectors each containing the respective pitch value and feature vector so as to derive therefrom a bit stream.
According to a second, complementary aspect of the invention there is provided a method for decoding a bit-stream representing a compressed series of acoustic vectors each containing a respective feature vector and a respective pitch value derived at a respective instance of time, each of the feature vectors having multiple components obtained by:
i) deriving at successive instances of time an estimate of the spectral envelope of a digitized speech signal,
ii) multiplying each estimate of the spectral envelope by a predetermined set of frequency domain window functions, wherein each window occupies a narrow range of frequencies, and computing the integral thereof, and
iii) assigning said integrals or a set of predetermined functions thereof to a respective one of said remaining components of the feature vector;
said method comprising the steps of:
(a) separating the received bit-stream into compressed feature vectors data and compressed pitch values data,
(b) decompressing the compressed feature vectors data and outputting quantized feature vectors,
(c) decompressing the compressed pitch values data and outputting quantized pitch values, and
(d) generating a continuous speech signal, using the quantized feature vectors and pitch values.
The invention will best be appreciated with regard to speech recognition schemes as currently implemented. All speech recognition schemes start by converting the digitized speech to a set of features that are then used in all subsequent stages of the recognition process. A commonly used set of feature vectors are the Mel-frequency Cepstral coefficients (MFCC) [1, 2], which can be regarded as a specific case of the above-described feature vectors. Transmitting a compressed version of the set of feature vectors removes the overhead required for decoding the speech. The feature extraction stage of the recognition process is replaced by feature decompression, which requires fewer computations by an order of magnitude. Furthermore, low bit rate transmission of the Mel-Cepstral features (4-4.5 Kbps) is possible without impairing the recognition accuracy [7].
The invention is based on the finding that if compressed pitch information is transmitted together with the speech recognition features, it is possible to obtain a good quality reproduction of the original speech.
The encoder consists of a feature extraction module, a pitch detection module and a features and pitch compression module. The decoder consists of a decompression module for the features and pitch and a speech reconstruction module.
It should be noted that in some recognition systems, especially for tonal languages, the pitch information is used for recognition and pitch detection is applied as a part of the recognition process. In that case, the encoder only compresses the information obtained anyway during the recognition process.
It is possible to encode additional components that are not used for speech recognition, but may be used by the decoder to enhance the reconstructed speech quality.
REFERENCES:
patent: 4797926 (1989-01-01), Bronson et al.
patent: 4827516 (1989-05-01), Tsukahara et al.
patent: 4914701 (1990-04-01), Zibman
patent: 4969193 (1990-11-01), Scott et al.
patent: 5054085 (1991-10-01), Meisel et al.
patent: 5583961 (1996-12-01), Pawlewski et al.
patent: 5754974 (1998-05-01), Griffin et al.
patent: 5909662 (1999-06-01), Yamazaki et al.
patent: 5933801 (1999-08-01), Fink et al.
patent: 6092039 (2000-07-01), Zingher
patent: 6336090 (2002-01-01), Chou et al.
Chazan Dan
Hoory Ron
Silvera Ezra
Zibulski Meir
Browdy and Neimark
Dorvil Richemond
International Business Machines - Corporation
Storm Donald L.
LandOfFree
Method and system for low bit rate speech coding with speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for low bit rate speech coding with speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for low bit rate speech coding with speech... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3223897