Enhanced likelihood computation using regression in a speech...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06493667

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to speech recognition systems and, more particularly, to methods and apparatus for performing enhanced likelihood computation using regression in speech recognition systems.
BACKGROUND OF THE INVENTION
It is known that a continuous speech recognition system, such as the IBM continuous speech recognition system, uses a set of phonetic baseforms and context dependent models. These models are built by constructing decision tree networks that query the phonetic context to arrive at the appropriate models for the given context. A decision tree is constructed for every arc (sub-phonetic unit that corresponds to a state of the three state Hidden Markov Model or HMM). Each terminal node (leaf) of the tree represents a set of phonetic contexts, such that the feature vectors observed in these contexts were close together as defined by how well they fit a diagonal gaussian model. The feature vectors at each terminal node are modeled using a gaussian mixture density with each gaussian having a diagonal covariance matrix. The IBM system also uses a rank-based decoding scheme, as described in Bahl et. al, “Robust-methods for using context-dependent features and models in a continuous speech recognizer,” ICASSP 1994, Vol. 1, pp. 533-536, the disclosure of which is incorporated herein by reference. The rank r(l, t) of a leaf l at time t is the rank order of the likelihood given the mixture model of this leaf in the sorted list of likelihoods computed using all the models of all the leaves in the system and sorting them in descending order. In a rank-based system the output distributions on the state transitions of the model are expressed in terms of the rank of the leaf. Each transition with arc label a has a probability distribution on ranks which typically has a peak at rank one and rapidly falls off to low probabilities for higher ranks. The probability of rank r(l, t) for arc a is then used as the probability of generating the feature vector at time t on the transition with arc a.
The more number of times a correct leaf appears in the top rank positions, the better the recognition accuracy. In order to improve the rank of the correct leaf, its likelihood score has to be boosted up relative to other leaves. This implies that the likelihood score for the correct leaf will be increased while those of the incorrect leaves will be decreased. A scheme to increase the likelihood of the correct leaf that captures the correlation between adjacent vectors using correlation models was introduced in, P. F. Brown, “The Acoustic-Modeling Problem in Automatic Speech Recognition,” Ph. D. thesis, IBM RC 12750, 1987.
The approach in the P. F. Brown thesis was to do away with the assumption that given the output distribution at time t, the acoustic observation at time t is independent of that at time t−1, or depends only on the transition taken at time t (P(y
t
|s
t
)), where y
t
refers to the cepstral vector corresponding to the speech at time t, s
t
refers to the transition at time t, and P(y
t
|s
t
) refers to the likelihood (i.e., probability) of generating y
t
on the transition at time t, as understood by those skilled in the art. The manner in which y
t−1
differs from the mean of the output distribution from which it is generated, influences the way that y
t
differs from the mean of the output distribution from which it is generated, where y
t−1
refers to the cepstral vector corresponding to the speech at time t−1. This is achieved by conditioning the probability of generating y
t
on the transition at time t, the transition at time t−1 (i.e., s
t−1
) and y
t−1
, that is:
P
(
y
t
|s
t
,s
t−1
,y
t−1
)  (1)
Incorporating this into an HMM would in effect square the number of output distributions and also increase the number of parameters in each output distribution. When the training data is not sufficient, the benefit of introducing the correlation concept may not be seen. Alternatively the probability could be conditioned only on the transition taken at time t and the output at y
t−t
, that is:
P
(
y
t
|s
t
,y
t−1
)  (2)
The output distribution for equation (2) has the form:
P
(
y
t
|s
t
,y
t−1
)=
det W
½
1/(2&pgr;)
n/2
exp
[−½](Z′WZ)  (3)
where W refers to covariance, and Z is given by:
(
Y
t
−(&mgr;
t
+C
(
y
t−1
−&mgr;
t−1
)))  (4)
where &mgr;
1
and &mgr;
t−1
refers to the mean at times i and t−1, respectively, as is known in the art, and C is the regression matrix given by:
C
=&Sgr;(
y
t
·y
t−1
)/|
y
t
|
2
  (5)
This form only increases the number of parameters in each output distribution and not the number of output distributions, making it computationally attractive. However, from a modeling perspective, it is less accurate than equation (1) because the distribution from which y
t−1
was generated and its deviation from its mean are unknown. There is an important trade-off between the complexity of an acoustic model and the quality of the parameters in that model. The greater the number of parameters in a model, the more variance there will be in the estimates of the probabilities of these acoustic events.
It would be highly desirable to provide techniques for use in speech recognition systems for enhancing the likelihood computation while minimizing or preventing an increase in the complexity of the HMMs.
SUMMARY OF THE INVENTION
The present invention provides for methods and apparatus for improving recognition performance in a speech recognition system by improving likelihood computation through the use of regression. That is, a methodology is provided that increases the likelihood of the correct leaf that captures the correlation between adjacent vectors using correlation models. According to the invention, regression techniques are used to capture such correlation. The regression predicts the neighboring frames of the current frame of speech. The prediction error likelihoods are then incorporated or smoothed into the overall likelihood computation to improve the rank position of the correct leaf, without increasing the complexity of the HMMs.
In an illustrative embodiment of the invention, a method for use with a speech recognition system in processing a plurality of frames of a speech signal includes tagging feature vectors associated with each frame received in a training phase with best aligning gaussian distributions. Then, forward and backward regression coefficients are estimated for the gaussian distributions for each frame. The method further includes computing residual error vectors from the regression coefficients for each frame and then modeling the prediction errors to form a set of gaussian models for the speech associated with the each frame. The set of gaussian models are then used to calculate three sets of likelihood values for each frame of a speech signal received during a recognition phase.
Advantageously, in order to achieve low error rates in a speech recognition system, for example, in a system employing rank-based decoding, we discriminate the most confusable incorrect leaves from the correct leaf by lowering their ranks. That is, we increase the likelihood of the correct leaf of a frame, while decreasing the likelihoods of the confusable leaves. In order to do this, we use the auxiliary information from the prediction of the neighboring frames to augment the likelihood computation of the current frame. We then use the residual errors in the predictions of neighboring frames to discriminate between the correct (best) and incorrect leaves of a given frame. We present a new methodology that incorporates prediction error likelihoods into the overall likelihood computation to improve the rank position of the correct leaf.
These and other objects, features and advantages of the present invention will become apparent from the following

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Enhanced likelihood computation using regression in a speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Enhanced likelihood computation using regression in a speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Enhanced likelihood computation using regression in a speech... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2941565

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.