Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
1999-03-12
2001-09-18
Tsang, Fan (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S220000, C704S231000, C704S236000, C704S239000
Reexamination Certificate
active
06292776
ABSTRACT:
TECHNICAL FIELD
The invention relates to the field of speech recognition and more particularly to a method and apparatus for improved hidden markov model (HMM) based speech recognition.
BACKGROUND OF THE INVENTION
The structure of a typical continuous speech recognizer consists of a front-end feature analysis stage followed by a statistical pattern classifier. The feature vector, interface between these two, should ideally contain all the information of the speech signal relevant to subsequent classification, be insensitive to irrelevant variations due to changes in the acoustic environments, and at the same time have a low dimensionality in order to minimize the computational demands of the classifier. Several types of feature vectors have been proposed as approximations of the ideal speech recognizer, as in the article by J. W. Picone, entitled “Signal Modeling Techniques in Speech Recognition”, Proceedings of the IEEE, Vol. 81, No. 9, 1993, pp.1215-1247. Most speech recognizers have traditionally utilized cepstral parameters derived from a linear predictive (LP) analysis due to the advantages that LP analysis provides in terms of generating a smooth spectrum, free of pitch harmonics, and its ability to model spectral peaks reasonably well. Mel-based cepstral parameters, on the other hand, take advantage of the perception properties of the human auditory system by sampling the spectrum at mel-scale intervals. Logically, combining the merits of both LP analysis and mel-filter bank analysis should, in theory, produce an improved set of cepstral features.
This can be performed in several ways. For example, one could compute the log magnitude spectrum of the LP parameters and then warp the frequencies to correspond to the mel-scale. Previous studies have reported encouraging speech recognition results when warping the LP spectrum by a bilinear transformation prior to computing the cepstrum, as opposed to not using the warping such as in M. Rahim and B. H. Juang, “Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition”, IEEE Transactions on Speech and Audio Processing, Vol. 4, No. 1, 1996, pp. 19-30. Several other frequency warping techniques have been proposed, for example in H. W. Strube, “Linear Prediction on a Warped Frequency Scale”, Journal of Acoustical Society of America, Vol. 68, No.4, 1980, pp. 1071-1076, a mel-like spectral warping method through all-pass filtering in the time domain is proposed. Another approach is to apply mel-filter bank analysis on the signal followed by LP analysis to give what will be refereed to as mel linear predictive cepstral (mel-lpc) features (see M. Rahim and B. H. Juang, “Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition”, EEE Transactions on Speech and Audio Processing}, Vol. 4, No. 1, 1996, pp. 19-30). The computation of the mel-lpc features is similar in some sense to perceptual linear prediction PLP coefficients explained by H. Hermansky, in “Perceptual Linear Predictive (PLP) analysis of Speech”, Journal of Acoustical Society of America, Vol. 87, No. 4, 1990, pp. 1738-1752. Both techniques apply a mel filter bank prior to LP analysis. However, the mel-lpc uses a higher order LP analysis with no perceptual weighting or amplitude compression. All the above techniques are attempts to perceptually model the spectrum of the speech signal for improved speech quality, and to provide more efficient representation of the spectrum for speech analysis, synthesis and recognition in a whole band approach.
In recent years there has been some work on subband-based feature extraction techniques, such as H. Bourlard and S. Dupont, “Subband-Based Speech Recognition”, Proc. ICASSP, 1997, pp. 1251-1254; P. McCourt, S. Vaseghi and N. Harte, “Multi-Resolution Cepstral Features for Phoneme Recognition Across Speech Subbands”, Proc. ICASSP, 1998, pp. 557-560. S. Okawa, E. Bocchieri and A. Potamianos, “Multi-Band Speech Recognition in Noisy Environments”, Proc. ICASSP, 1998, pp. 641-644; and S. Tibrewala and H. Hermansky, “Subband Based Recognition of Noisy Speech”, Proc. ICASSP, 1997, pp. 1255-1258. The article P. McCourt, S. Vaseghi and N. Harte, “Multi-Resolution Cepstral Features for Phoneme Recognition Across Speech Subbands”, Proc. ICASSP, 1998, pp. 557-560 indicates that use of multiple resolution levels yield no further advantage. Additionally , a recent theoretical and empirical results have shown that auto-regressive spectral estimation from subbands is more robust and more efficient than full-band auto-regressive spectral estimation S. Rao and W. A. Pearlman, “Analysis of Linear Prediction, Coding and Spectral Estimation from Subbands”, IEEE Transactions on Information Theory, Vol. 42, 1996, pp. 1160-1178.
As the articles cited above tend to indicate, there is still a need for advances and improvements in the art of speech recognizers.
It is an object of the present invention to provide a speech recognizer that has the advantages of both a linear predictive analysis and a subband analysis.
SUMMARY OF THE INVENTION
Briefly stated, an advance in the speech recognizer art achieved by providing an approach for prediction analysis, where the predictor is computed from a number of mel-warped subband-based autocorrelation functions obtained from the frequency spectrum of the input speech. Moreover, a level of sub-band decomposition and subsequent cepstral analysis can be increased such that features may be selected from a pyramid of resolution levels. An extended feature vector is formed based on concatenation of LP cepstral features from each multi-resolution sub-band, defining a large dimensional space on which the statistical parameters are estimated.
In a preferred embodiment, an advance in the art is provided by a method and apparatus for a recognizer based on hidden Markov model (HMM) which uses continuous density mixtures to characterize the states of the HMM. An additional relative advantage is obtained by using a multi-resolution feature set in which the inclusion of different resolutions of sub-band decomposition in effect relaxes the restriction of using a single fixed speech band decomposition and leads to fewer string errors.
In accordance with another embodiment of the invention, an advance in the art is achieved by providing an improved speech recognizer which uses multi-resolution mel-lpc features.
REFERENCES:
patent: 5271088 (1993-12-01), Bahler
patent: 5590242 (1996-12-01), Juang et al.
patent: 5765124 (1998-06-01), Rose et al.
patent: 5806022 (1998-09-01), Rahim et al.
patent: 5864806 (1999-01-01), Mokbel et al.
patent: 5867816 (1999-02-01), Nussbaum
patent: 5930753 (1999-06-01), Potamianos et al.
patent: 6064958 (2000-05-01), Takahashi et al.
patent: 6112175 (2000-08-01), Chengalvarayan
patent: 6157909 (2000-12-01), Mauuary et al.
patent: 674306A2 (1995-03-01), None
Tokuda et al, “A Very Low Bit Rate Speech Coder Using HMM-based Speech Recognition/Synthesis Techniques”, IEEE Acoustics, Speech and Signal Processing, vol. 2 pp 609-612,Jun. 1998.*
Strope et al, “Robust Word Recognition Using Threaded Spectral Pairs”, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2 pp625-628, Jun. 1998.*
Nadeu et al, “Frequency and Time Filtering of Filter-Bank Energies for HMM Speech Recognition”, Spoken Language, 1996, ICSLP, pp 430-433, vol. 1.*
McCourt, et al.: “Multi-Resolution Cepstral Features For Phoneme Recognition Across Speech Sub-Bands” IEEE International Conference On Acoustics, Speech And Processing—X{000854639—May 15, 1998.
A. Hermansky: “Perceptual Linear Predictive (PLP) Analysis Of Speech”—Journal Of The Acoustical Society Of America, US, American Institute of Physics—XP000110674—Apr. 1, 1990.
H. Bourlard & S. Dupont, “Subband-Based Speech Recognition”,Proc. ICASSP, 1997, pp. 1251-1254.
W. Chou, M. G. Rahim & E. Buhrke, “Signal Conditioned Minimum Error Rate Training”,Proc. Eurospeech, 1995, pp. 495-498.
T. Eisele, R. Haeb-Umbach & D. Langmann, “A Comparative Study Of Linear Feature Trans
Lucent Technologies - Inc.
Opsasnick Michael N.
Penrod J. R.
Tsang Fan
LandOfFree
Hierarchial subband linear predictive cepstral features for... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Hierarchial subband linear predictive cepstral features for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Hierarchial subband linear predictive cepstral features for... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2476950