Automatic speech recognition with psychoacoustically-based...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S251000

Reexamination Certificate

active

06701291

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to the field of automatic speech recognition and more particularly to a speech signal feature extraction method and apparatus for use therein which is easily tunable and thereby provides improved performance, especially in a variety of adverse (i.e., noisy) environments.
BACKGROUND OF THE INVENTION
In Automatic Speech Recognition (ASR) systems, certain characteristics or “features” of the input speech are compared to a corresponding set of features which have been stored in “models” based on an analysis of previously supplied “training” speech. Based on the results of such a comparison, the input speech is identified as being a sequence of a possible set of words—namely, the words of training speech from which the most closely matching model was derived. The process known as “feature extraction” is the first crucial step in the ASR process.
Specifically, feature extraction comprises extracting a predefined set of parameter values—most typically, cepstral (i.e., frequency-related) coefficients—from the input speech to be recognized, and then using these parameter values for matching against corresponding sets of parameter values which have been extracted from a variety of training speech utterances and stored in the set of speech models. Based on the results of such a matching process, the input speech can be “recognized”—that is, identified as being the particular utterance from which one of the speech models was derived.
Currently, there are two common approaches to feature extraction which are used in automatic speech recognition systems—modeling the human voice production mechanism (i.e., the vocal tract) and modeling the human auditory perception system (i.e., the human cochlea and its processing). For the first approach, one of the most commonly employed features comprises a set cepstral coefficients derived from linear predictive coding techniques (LPCC). This approach uses all-pole linear filters which simulate the human vocal tract. A narrow band (e.g., 4 kHz) LPCC feature works fairly well in the recognition of speech produced in a “clean” noise-free environment, but experiments have shown that such an approach results in large distortions in noisy environments, thereby causing a severe degradation of the ASR system performance.
It is generally accepted that improved performance in an ASR system which needs to be robust in noisy environments can be better achieved with use of the second approach, wherein the human auditory perception system is modeled. For this class of techniques, the most common feature comprises the set of cepstral coefficients derived from the outputs of a bank of filters placed in mel frequency scale (MFCC), familiar to those of ordinary skill in the art. The filters are typically in triangular shapes, and are operated in the frequency domain. Note that the mel frequency scale is similar to the frequency response of the human cochlea. Like the LPCC feature, the MFCC feature works very well in “clean” environments, and although its performance in “adverse” (i.e., noisy) environments may be superior to that of LPCC, ASR systems which have been implemented using the MFCC feature have still not provided adequate performance under many adverse conditions.
Perceptual linear predictive (PLP) analysis is another auditory-based approach to feature extraction. It uses several perceptually motivated transforms including Bark frequency, equal-loudness pre-emphasis, masking curves, etc. In addition, the relative spectra processing technique (RASTA) has been further developed to filter the time trajectory in order to suppress constant factors in the spectral component. It has often been used together with the PLP feature, which is then referred to as the RASTA-PLP feature. Like techniques which use the MFCC feature, the use of these techniques in implemented ASR systems have often provided unsatisfactory results when used in many noisy environments.
Each of the above features is typically based on a Fast Fourier Transform (FFT) to convert speech waveforms from a time domain representation to a frequency domain representation. In particular, however, note that the FFT and other, typical, frequency transforms produce their results on a linear frequency scale. Thus, each of the above perception-based approaches necessarily must perform the filtering process essentially as does the human cochlea—with a complex set of filters differentially spaced in frequency, for example, in accordance with a mel or Bark scale. Moreover, the filters must be individually shaped depending on the particular filter's location along the scale.
Because of the high degree of complexity in developing filter sets for each of these approaches, it has proven to be very difficult to implement ASR systems which have performed well in various noisy environments. In particular, such ASR systems cannot be easily modified (i.e., “tuned”) to optimize its performance in different acoustic environments. As such, it would be advantageous to derive an auditory-based speech feature which included a filter set of reduced overall complexity, thereby allowing for the design and implementation of a relatively easily tunable ASR system whose operation can be optimized in a variety of (e.g., noisy) acoustic environments.
SUMMARY OF THE INVENTION
In accordance with the principles of the present invention, an auditory-based speech feature is provided which advantageously includes a filtering scheme which can be easily tuned for use in ASR in a variety of acoustic environments. In particular, the present invention provides a method and apparatus for extracting speech features from a speech signal in which the linear frequency spectrum of the speech signal, as generated, for example, by a conventional frequency transform, is first converted to a logarithmic frequency spectrum having frequency data distributed on a substantially logarithmic (rather than linear) frequency scale. Then, a plurality of filters is applied to the resultant logarithmic frequency spectrum, each of these filters having a substantially similar mathematical shape, but centered at different points on the logarithmic frequency scale. Because each of the filters has a similar shape, an ASR system incorporating the feature extraction approach of the present invention advantageously can be modified or tuned easily, by adjusting each of the filters in a coordinated manner and requiring the adjustment of only a handful of filter parameters.
In accordance with one illustrative embodiment of the present invention, the frequency transform is the FFT, the substantially logarithmic frequency scale is a Bark scale, and the plurality of filters are distributed (i.e., centered) at equal distances along the Bark scale. Also in accordance with this illustrative embodiment of the present invention, an outer and middle ear transfer function is applied to the frequency data prior to the conversion of the frequency spectrum from a linear frequency scale to the substantially logarithmic frequency scale, wherein the outer and middle ear transfer function advantageously approximates the signal processing performed by the combination of the human outer ear and the human inner ear. In addition, and also in accordance with this illustrative embodiment of the present invention, a logarithmic nonlinearity is advantageously applied to the outputs of the filters, and is followed by a discrete cosine transform (DCT) which advantageously produces DCT coefficients for use as speech features in an ASR system.


REFERENCES:
patent: 6012334 (2000-01-01), Ando et al.
patent: 6076058 (2000-06-01), Chengalvarayan
patent: 6370504 (2002-04-01), Zick et al.
patent: 6438243 (2002-08-01), Ikeuchi et al.
patent: 2003/0018471 (2003-01-01), Cheng et al.
Daivs, S. B., et al., “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences”,IEEE Transactions on Acoustics, Speech, And Signal Processing, vol. ASSP-28, No. 4, pp. 357-366 (1980).
Makhoul, J., “Linear Prediction: A

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Automatic speech recognition with psychoacoustically-based... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Automatic speech recognition with psychoacoustically-based..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Automatic speech recognition with psychoacoustically-based... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3261967

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.