Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-04-05
2002-12-03
Chawan, Vijay (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
Reexamination Certificate
active
06490555
ABSTRACT:
FIELD OF THE INVENTION
The invention generally relates to automatic speech recognition, and more particularly, to a technique for adjusting the mixture components of hidden Markov models as used in automatic speech recognition.
BACKGROUND ART
The goal of automatic speech recognition (ASR) systems is to determine the lexical identity of spoken utterances. The recognition process, also referred to as classification, begins with the conversion of the acoustical signal into a stream of spectral vectors or frames that describe the important characteristics of the signal at specified times. Classification is attempted by first creating reference models that describe some aspect of the behavior of spectral frames corresponding to different words.
A wide variety of models have been developed, but they all share the property that they describe the temporal characteristics of spectra typical to particular words or sub-word segments. The sequence of spectra arising from an input utterance is compared to such models, and the success with which different models predict the behavior of the input frames determines the putative identity of the utterance.
Currently most systems utilize some variant of a statistical model called the hidden Markov model (HMM). Such models consist of sequences of states connected by arcs, and a probability density function (pdf) associated with each state which describes the likelihood of observing any given spectral vector at that state. A separate set of probabilities determines transitions between states.
Various levels of modeling power are available in the case of the probability densities describing the observed spectra associated with the states of the HMM. There are two major approaches: the discrete pdf and the continuous pdf. With continuous pdfs, parametric functions specify the probability of any arbitrary input spectral vector given a state. The most common class of functions used for this purpose is a mixture of Gaussians, where arbitrary pdfs are modeled by a weighted sum of normal distributions. One drawback of using continuous pdfs is that the designer must make explicit assumptions about the nature of the pdf being modeled—something which can be quite difficult since the true distribution form for the speech signal is not known. In addition, continuous pdf models are computationally far more expensive than discrete pdf models.
The total number of pdfs in a recognition system depends on the number of distinct HMM states, which in turn is determined by type of models used—e.g., phonetic or word models. In many systems the states from different models can be pooled—i.e., the states from different models can share pdfs from a common set or pool. For example, some states from two different models that represent a given phone in different phonetic contexts (i.e., an allophone) may have similar pdfs. In some systems these pdfs will be combined into one, to be shared by both states. This may be done to save memory and in some instances to overcome a problem known as undertraining.
The model pdfs, whether discrete or continuous, are most commonly trained using the maximum likelihood method. In this manner, the model parameters are adjusted so that the likelihood of observing the training data given the model is maximized. However, it is known that this approach does not necessarily lead to the best recognition performance. This realization has led to the development of new training criteria, known as discriminative, the objective of which is to adjust model parameters so as to minimize the number of recognition errors rather than fit the distributions to the data.
FIG. 1
shows a feature vector
10
representative of an input speech frame in a multidimensional vector space, a “correct” state S
C
11
from the model that corresponds to the input speech, and an “incorrect” state S
I
12
from a model that does not correspond to the input speech. As shown in
FIG. 1
, the vector space distance from the feature vector
10
to the best branch
13
(the closest mixture component) of correct state S
C
11
, is very nearly the same as the vector space distance from the feature vector
10
to the best branch
14
of the incorrect state S
I
12
. In this situation, there is very little basis at the state level for distinguishing the correct state S
C
11
from the incorrect state S
I
12
.
Discriminative training attempts to adjust the best branch
13
of correct state S
C
11
a little closer to the vector space location of feature vector
10
, and adjust the best branch
14
of the incorrect state S
I
12
a little farther from the vector space location of feature vector
10
. Thus, a future feature vector near the vector space of feature vector
10
will be more likely to be identified with correct state S
C
11
than with incorrect state S
I
12
. Of course discriminative training may adjust the vector space of the correct state with respect to multiple incorrect states. Similarly, rather than adjusting the best branches of the states, a set of mixture components within each state may be adjusted.
While discriminative training shows considerable promise, so far it has been applied most successfully to small vocabulary and isolated word recognition tasks. In addition, discriminative training presents a number of new problems, such as how to appropriately smooth the discriminatively trained pdfs, and how to adapt these systems to a new user with a relatively small amount of training data.
U.S. Pat. No. 6,260,013 describes a system using discriminatively trained multi-resolution models in the context of an isolated word recognition system. However, the techniques described therein are not efficiently extensible to a continuous speech recognition system.
SUMMARY OF THE INVENTION
A representative embodiment of the present invention includes a method of a continuous speech recognition system for discriminatively training hidden Markov for a system recognition vocabulary. An input word phrase is converted into a sequence of representative frames. A correct state sequence alignment with the sequence of representative frames is determined, the correct state sequence alignment corresponding to models of words in the input word phrase. A plurality of incorrect recognition hypotheses is determined representing words in the recognition vocabulary that do not correspond to the input word phrase, each hypothesis being a state sequence based on the word models in the acoustic model database. A correct segment of the correct word model state sequence alignment is selected for discriminative training. A frame segment of frames in the sequence of representative frames is determined that corresponds to the correct segment. An incorrect segment of a state sequence in an incorrect recognition hypothesis is selected, the incorrect segment corresponding to the frame segment. A discriminative adjustment is performed on selected states in the correct segment and the corresponding states in the incorrect segment.
In a further embodiment, performing a discriminative adjustment occurs in a batch training mode at the end of a user session with the speech recognition system, and the discriminative adjustment performed on the selected and corresponding states represents a sum of calculated adjustments over the session. Alternatively, performing a discriminative adjustment may occur in an on-line mode in which the selected and corresponding states are discriminatively adjusted for each input word phrase.
Performing a discriminative adjustment may include using a language model weighting of the selected and corresponding states, in which case, when the selected segment of an incorrect recognition hypothesis is a fractional portion of a word model state sequence, the language model weighting for the fractional portion corresponds to the fractional amount of the word model that the fractional portion represents. The discriminative adjustment may include performing a gradient adjustment to selected branches of a selected state in the correct hypothesis model and a corresponding state in the inco
Sarukkai Ramesh
Sejnoha Vladimir
Yegnanarayanan Girija
Bromberg & Sunstein LLP
Chawan Vijay
Opsasnick Michael N.
ScanSoft, Inc.
LandOfFree
Discriminatively trained mixture models in continuous speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Discriminatively trained mixture models in continuous speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Discriminatively trained mixture models in continuous speech... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2966659