Wavelet-based energy binning cepstal features for automatic...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S237000, C704S245000

Reexamination Certificate

active

06253175

ABSTRACT:

BACKGROUND
1. Technical Field
The present application relates generally to speech recognition and, more particularly, to an acoustic signal processing system and method for providing wavelet-based energy binning cepstral features for automatic speech recognition.
2. Description of the Related Art
In general, there are many well-known signal processing techniques which are utilized in speech-based applications, such as speech recognition, for extracting spectral features from acoustic speech signals. The extracted spectral features are used to generate reference patterns (acoustic models) for certain identifiable sounds (phonemes) of the input acoustic speech signals.
Referring now to
FIG. 1
, a generalized speech recognition system in accordance with the prior art is shown. The speech recognition system
100
generally includes and acoustic front end
102
for preprocessing of speech signals, i.e. input utterances for recognition and training speech. Typically, the acoustic front end
102
includes a microphone to convert the acoustic speech signals into an analog electrical signals having a voltage which varies over time in correspondence to the variations in air pressure caused by the input speech utterances. The acoustic front end also includes an analog-to-digital (A/D) converter for digitizing the analog signal by sampling the voltage of the analog waveform at a desired “sampling rate” and converting the sampled voltage to a corresponding digital value. The sampling rate is typically selected to be twice the highest frequency component (which, e.g., is 16 khz for pure speech or 8 khz for a communication channel having a 4 kz bandwidth).
Digital signal processing is performed on the digitized speech utterances (via the acoustic front end
102
) by extracting spectral features to produce a plurality of feature vectors which, typically, represent the envelope of the speech spectrum. Each feature vector is computed for a given frame (or time interval) of the digitized speech, with each frame representing, typically, 10 ms to 30 msec. In addition, each feature vector includes “n” dimensions (parameters) to represent the sound within the corresponding time frame.
The system includes a training module
104
which uses the feature vectors generated by the acoustic front end
102
from the training speech to train a plurality of acoustic models (prototypes) which correspond to the speech baseforms (e.g., phonemes). A decoder
106
uses the trained acoustic models to decode (i.e., recognize) speech utterances by comparing and matching the acoustic models with the feature vectors generated from the input utterances using techniques such as the Hidden Markov Models (HMM) and Dynamic Time Warping (DTW) methods disclosed in “Statistical Methods For Speech Recognition”, by Fred Jelinek, MIT Press, 1997, which are well-known by those skilled in the art of speech recognition.
Conventional feature extraction methods for automatic speech recognition generally rely on power spectrum approaches, whereby the acoustic signals are generally regarded as a one dimensional signal with the assumption that the frequency content of the signal captures the relevant feature information. This is the case for the spectrum representation, with its Mel or Bark variations, the cepstrum, FFT-derived (Fast Fourier Transform) or LPC-derived (Linear Predictive Coding), LPC derived features, the autocorrelation, the energy content, and all the associated delta and delta-delta coefficients.
Cepstral parameters are, at present, widely used for efficient speech and speaker recognition. Basic details and justifications can be found in various references: J. R. Deller, J. G. Proakis, and J. H. L. Hansen, “Discrete Time Processing of Speech Signals”, Macmillan, New York, N.Y., 1993; S. Furui, “Digital Speech Processing, Synthesis and Recognition”, Marcel Dekker, New York, N.Y., 1989; L. Rabiner and B-H. Juang, “Fundamentals of Speech Recognition”, Prentice-Hall, Englewood Cliffs, N.J., 1993; and A. V. Oppenheim and S.W. Schaffer, “Digital Signal Processing”, Prentice-Hall, Englewood Cliffs, N.J., 1975. Originally introduced to separate the pitch contribution from the rest of the vocal cord and vocal tract spectrum, the cepstrum has the additional advantage of approximating the Karhunen-Loéve transform of speech signal. This property is highly desirable for recognition and classification.
Speech production models, coding methods as well as text to speech technology often lead to the introduction of modulation models to represent speech signals with primary components which are amplitude-and-phase-modulated sine functions. For example, the conventional modulation model (MM) represents speech signals as a linear combination of amplitude and phase modulated components:
f

(
t
)
=

k
=
1
K

A
k

(
t
)

cos

[
θ
k

(
t
)
]
+
η

(
t
)
where Ak(t) is the instantaneous amplitude, w
k
(t)=d/dt&thgr;
k
(t) is the instantaneous frequency of component (or formant) k, and where N(t) takes into account the errors of modelling. In a more sophisticated model, the components are viewed as “ribbons” in the time-frequency plane rather than curves, and instantaneous bandwidths &Dgr;w
k
(t) are associated with each component. These parameters can be extracted and processed to generate feature vectors for speech recognition.
Other methods which characterize speech with phase-derived features are, for example, the EIH (Ensemble Interval Histogram) (see
0
. Ghitza, “Auditory Models and Human Performances in Tasks Related to Speech Coding and Speech Recognition”, IEEE Trans. Speech Audio Proc., 2(1):pp. 115-132, 1994), SBS (in-synchrony Bands Spectrum) (see
0
. Ghitza, “Auditory Nerve Representation Criteria For Speech Analysis/Synthesis”, IEEE Trans. ASSP, 6(35):pp 736-740, June 1987), and the IFD (Instantaneous-Frequency Distribution) (see D. H. Friedman, “Instantaneous-Frequency Distribution Vs. Time: An Interpretation of the Phase Structure of Speech”, IEEE Proc. ICASSP, pp 1121-1124, 1985). These models are derived from (nonplace/temporal) auditory nerve models of the human auditory nerve system.
In addition, the wavelet transform (WT) is a widely used time-frequency tool for signal processing which has proved to be well adapted for extracting the modulation laws of isolated or substantially distinct primary components. The WT performed with a complex analysis wavelet is known to carry relevant information in its modulus as well as in its phase. The information contained in the modulus is similar to the power spectrum derived parameters. The phase is partially independent of the amplitude level of the input signal. Practical considerations and intrinsic limitations, however, limit the direct application of the WT for speech recognition purposes.
Parellelisms between properties of the wavelet transform of primary components and algorithmic representations of speech signals derived from auditory nerve models like the EIH have led to the introduction of “synchrosqueezing” measures: a novel transformation of the time-scale plane obtained by a quasi-continuous wavelet transform into a time-frequency plane (i.e., synchrosqueezed plane) (see, e.g., “Robust Speech and Speaker Recognition Using Instantaneous Frequencies and Amplitudes Obtained With Wavelet-Derived Synchrosqueezing Measures”, Program on Spline Functions and the Theory of Wavelets, Montreal, Canada, March 1996, Centre de Recherches Mathématiques, Université de Montréal (invited paper). On the other hand, as stated above, in automatic speech recognition, cepstral feature have imposed themselves quasi-universally as acoustic characteristic of speech utterances. The cepstrum can be seen as explicit functions of the formants and other primary components of the modulation model. Two main classes of cepstrum extraction have been intensively used: LPC-derived cepstrum and FFT cepstrum. The second approach has become dominant usually with Mel-binning. Accordingly, a method for extracting spectral features which utilizes these conventiona

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Wavelet-based energy binning cepstal features for automatic... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Wavelet-based energy binning cepstal features for automatic..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Wavelet-based energy binning cepstal features for automatic... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2487136

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.