Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
1998-10-02
2001-10-30
Dorvil, Richemond (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S219000
Reexamination Certificate
active
06311153
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates a method and an apparatus for compressing an audio signal obtained by transforming music into an electric signal, and a method and an apparatus for compressing a speech signal obtained by transforming speech into an electric signal, which are capable of compressing the audio signal or the speech signal more efficiently than conventional methods and apparatuses while maintaining a high sound quality, in particular, when compressing the audio signal or the speech signal using a weighting function on frequency based on human auditory characteristics, in order to enable information transmission of the audio signal or the speech signal by a transmission line of a small capacity and efficient storage of the audio signal or the speech signal into recording media.
The present invention further relates to a method and an apparatus for recognizing speech, which are capable of providing a higher recognition rate than conventional methods and apparatuses, in particular, when performing recognition using parameters having different resolutions for different frequencies, which parameters are obtained by a linear prediction coding analysis utilizing human auditory characteristics.
BACKGROUND OF THE INVENTION
There have been proposed a variety of audio signal compression methods of this type and, hereinafter, one example of those methods will be described.
Initially, a time series of an input audio signal is transformed into a frequency characteristic signal sequence for each length of a specific period (frame) by MDCT (modified discrete cosine transform), FFT (fast Fourier transform) or the like. Further, the input audio signal is subjected to linear predictive analysis (LPC analysis), frame by frame, to extract LPC coefficients (linear predictive coefficients), LSP coefficients (line spectrum pair coefficients), PARCOR coefficients (partial auto-correlation coefficients) or the like, and an LAC spectrum envelop is obtained from these coefficients. Next, the frequency characteristic is flattened by dividing the calculated frequency characteristic signal sequence with the LPC spectrum envelope and normalizing it, and then the power is normalized using the maximum value or the mean value of the power.
In the following description, output coefficients at the power normalization are called “residual signals”. Further, the flattened residual signals are vector-quantized using the spectrum envelope as a weight.
As an example of such audio signal compression method, there is TwinVQ (Iwagami, Moriya, Miki: “Audio Coding by Frequency-Weighted Interleave Vector Quantization (TwinVQ)” Anthology of Lectured Papars of Acoustic Society, 1-P-1, pp.3390-340, 1994).
Next, a speech signal compression method according to a prior art will be described.
First of all, a time series or an input speech signal is subjected to LPC analysis for each frame, whereby it is divided into LPC spectrum envelope components, such as LPC coefficients, LSP coefficients, or PARCOR coefficients, and residual signals, the frequency characteristic of which is flattened. The LPC spectrum envelope components are Scalar-quantized, and the flattened residual signals are quantized according to a previously prepared sound source code book, whereby the components and the signals are transformed into digital signals, respectively.
As an example of such speech signal compression method, there is CELP (M. R. Schroeder and B. S. Atal, “Code-excited Linear Prediction (CELP) High Quality Speech at Very Low Rates”, Proc. ICASSP-85, March 1085).
Further, a speech recognition method according to a prior art will be described.
Generally, in a speech recognition apparatus, speech recognition is performed as follows. A standard model for each phoneme or word is formed in advance by using speech data as a base, and a parameter corresponding to a spectrum envelope is obtained from an input speech. Then, the similarity between the time series of the input speech and the standard model is calculated, and a phoneme or word corresponding to the standard model having the highest similarity is found. In this case, hidden Markov model (HMM) or the time series itself of a representative parameter is used as the standard model (Seiici Nakagawa “Speech Recognition by Probability Model”, Edited by Electronics Information and Communication Society, pp.18-80.)
Conventionally, recognition is performed using, as a time series of a parameter obtained from an input speech, the following cepstrum coefficients: LPC cedstrum coefficients which are obtained by transforming a time series of an input speech into LPC coefficients for each length of a specific period (frame) by LPC analysis and then subjecting the resulting LPC coefficients to cepstrum transform (“Digital Signal Processing of Speech and Audio Information”, by Kiyohiro Sikano, Sazosi Nakamura, Siro Ise, Shyokodo, pp.10-16), or cepstrum coefficients which are obtained by transforming an input speech into power spectrums for each length of a specific period (frame) by DFT or band pass filter bank and then subjecting the resulting power spectrums to cepstrum transformation.
In the prior art audio signal compression method, residual signals are obtained by dividing a frequency characterized signal sequence calculated by MDCT or FFT with an LPC spectrum envelop, and normalizing the result.
On the other hand, in the prior art speech signal compression method, an input audio signal is separated into an LPC spectrum envelope calculated by LPC analysis and residual signals. The prior art audio signal compression method and the prior art speech signal compression method are similar in that spectrum envelop components are removed from the input signal by the standard LPC analysis, i.e., residual signals are obtained by normalizing (flattening) the input signal by the spectrum envelope. Therefore, if the performance of this LPC analysis is improved or the estimated precision of the spectrum envelop obtained by the LPC analysis is increased, it is possible to compress information more efficiently than the prior art methods while maintaining a high sound quality.
In the standard LPC analysis, an envelop is estimated with a frequency resolution of the same precision for each frequency band. Therefore, in order to increase the frequency resolution for a low frequency band which is auditively important, i.e., in order to obtain a spectrum envelop of a low frequency band precisely, the analysis order must be increased, resulting in increased amount of information.
Further, to increase the analysis order results in an unnecessary increase in resolution for a high frequency band which is not auditively very important. In this case, calculation of a spectrum envelop having a peak in a high frequency band might be required, thereby degrading the sound quality.
Furthermore, in the prior art audio signal compression method, when vector quantization is performed, weighting is carried out on the basis of a spectrum envelop alone. Therefore, efficient quantization utilizing human auditory characteristics is impossible in the standard LPC analysis.
In the prior art speech recognition method, if LPC cepstrum coefficients obtained by the standard LPC analysis are used for the recognition, sufficient recognition performance might not be done because the LPC analysis is not based on human auditory characteristics.
It is well known that the human hearing fundamentally has a tendency to regard low-band frequency components as important and regard high-band frequency components as less important than the low-band components.
There is proposed a recognition method based on such tendency wherein recognition is performed using mel-LPC coefficients which are obtained by subjecting the LPC cepstrum coefficients to mel-transformation (“Digital Signal Processing of Speech and Audio Information”, by Kiyohiro Sikano, Satosi Nakamura, Siro Ise, Shyokodo, pp.39~40). However, in the LPC analysis for producing LPC cepstrum coefficients, human auditory characteristics are not sufficiently considered and, th
Ishikawa Tomokazu
Katayama Taro
Nakahashi Jun-ichi
Nakatoh Yoshihisa
Norimatsu Takeshi
Dorvil Richemond
Matsushita Electric - Industrial Co., Ltd.
Wenderoth , Lind & Ponack, L.L.P.
LandOfFree
Speech recognition method and apparatus using frequency... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speech recognition method and apparatus using frequency..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech recognition method and apparatus using frequency... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2613360