Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-04-27
2003-09-30
Abebe, Daniel (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S253000
Reexamination Certificate
active
06629073
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to speech recognition. In particular, the present invention relates to the use of models to perform speech recognition.
In speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector is typically multi-dimensional and represents a single frame of the speech signal.
To identify a most likely sequence of words, the feature vectors are applied to one or more models that have been trained using a training text. Typically, this involves applying the feature vectors to a frame-based acoustic model in which a single frame state is associated with a single feature vector. Recently, however, segment models have been introduced that associate multiple feature vectors with a single segment state. The segment models are thought to provide a more accurate model of large-scale transitions in human speech.
All models, both frame based and segment based, determine a probability for an acoustic unit. In initial speech recognition systems, the acoustic unit was an entire word. However, such systems required a large amount of modeling data since each word in the language had to be modeled separately. For example, if a language contains 10,000 words, the recognition system needed to 10,000 models.
To reduce the number of models needed, the art began using smaller acoustic units. Examples of such smaller units include phonemes, which represent individual sounds in words, and senones, which represent individual states within phonemes. Other recognition systems used diphones, which represent an acoustic unit spanning from the center of one phoneme to the center of a neighboring phoneme.
When determining the probability of a sequence of feature vectors, speech recognition systems of the prior art did not mix different types of acoustic units. Thus, when determining a probability using a phoneme acoustic model, all of the acoustic units under consideration would be phonemes. The prior art did not use phonemes for some segments of the speech signal and senones for other parts of the speech signal. Because of this, developers had to decide between using larger units that worked well with segment models or using smaller units that were easier to train and required less data.
During speech recognition, the probability of an individual acoustic unit is often determined using asset of Gaussian distributions. At a minimum, a single Gaussian distribution is provided for each feature vector spanned by the acoustic units.
The Gaussian distributions are formed from training data and indicate the probability of a feature vector having a specific value for a specific acoustic unit. The distributions are formed by measuring the values of the feature vectors that are generated by a trainer reciting from a training text. For example, for every occurrence of the phoneme “th” in the training text, the resulting values of the feature vectors are measured and used to generate the Gaussian distribution.
Because different speakers produce different speech signals, a single Gaussian distribution for an acoustic unit can sometimes produce a high error rate in speech recognition simply because the observed feature vectors were produced by a different speaker than the speaker used to train the system. To overcome this, the prior art introduced a mixture of Gaussian distributions for each acoustic unit. Within each mixture, a separate Gaussian is generated for one group of speakers. For example, there could be one Gaussian for the male speakers and one Gaussian for the female speakers.
Using a mixture of Guassians, each acoustic unit has multiple targets located at the mean of each Guassian. Thus, for a particular acoustic unit, one target may be from a male training voice and another target may be from a female training voice.
Since the probability associated with each acoustic unit is determined serially under the prior art, it is possible to use targets associated with two different groups of speakers when determining the probabilities of feature vectors for two neighboring acoustic units. Thus, in one acoustic unit, a target associated with a male trainer may be used to determine the probability of a set of feature vectors and in the next acoustic unit a target associated with a female speaker may be used to determine the probability of a set of feature vectors. Such a discontinuity in the targets between neighboring acoustic units is undesirable because it represents a trajectory in the speech signal that never occurs in the training data. Such a trajectory is known as a phantom trajectory in the art.
SUMMARY OF THE INVENTION
A speech recognition method and system utilize an acoustic model that is capable of providing probabilities for both a large acoustic unit and an acoustic sub-unit. Each of these probabilities describes the likelihood of a set of feature vectors from a series of feature vectors representing a speech signal. The large acoustic unit is formed from a plurality of acoustic sub-units. At least one sub-unit probability and at least one large unit probability from the acoustic model are used by a decoder to generate a score for a sequence of hypothesized words. When combined, the acoustic sub-units associated with all of the sub-unit probabilities used to determine the score span fewer than all of the feature vectors in the series of feature vectors.
In some embodiments of the invention, an overlapping decoding technique is used. In this decoding system, two acoustic probabilities are determined for two sets of feature vectors wherein the two sets of feature vectors are different from each other but include at least one common feature vector. A most likely sequence of hypothesized words is then identified using the two acoustic probabilities.
REFERENCES:
patent: 4914703 (1990-04-01), Gillick
patent: 5133012 (1992-07-01), Nitta
patent: 5369726 (1994-11-01), Kroeker et al.
patent: 5572624 (1996-11-01), Sejnoha
patent: 5617509 (1997-04-01), Kushner et al.
patent: 5625749 (1997-04-01), Goldenthal et al.
patent: 5787396 (1998-07-01), Komori et al.
patent: 5937384 (1999-08-01), Huang et al.
patent: 6055498 (2000-04-01), Neumeyer et al.
patent: 6092045 (2000-07-01), Stubley et al.
patent: 6185528 (2001-02-01), Fissore et al.
“Probabilistic-trajectory segmental HMMs”,Computer Speech and Language,by Wendy J. Holmes et al., Article No. csla. 1998.0048, pp. 3-37 (1999).
“Parametric Trajectory Mixtures for LVCSR”, by Man-hung Siu et al., ICSLP-1998, 4 pages.
“Speech Recognition Using Hidden Markov Models with Polynomial Regression Functions as Nonstationary States”, by Li Deng et al., IEEE Transactions on Speech and Audio Processing, vol. 2, No. 4, pp. 507-520 (Oct. 1994).
“From HMM's to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition”, by Mari Ostendorf et al., IEEE Transactions on Speech and Audio Processing, vol. 4, No. 5, pp. 360-379 (Sep. 1996).
U.S. patent application Ser. No. 09/560,506, Ho et al., filed Apr. 27, 2000.
Hon Hsiao-Wuen
Wang Kuansan
Abebe Daniel
Magee Theodore M.
Microsoft Corporation
Westman Champlin & Kelly P.A.
LandOfFree
Speech recognition method and apparatus utilizing multi-unit... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speech recognition method and apparatus utilizing multi-unit..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech recognition method and apparatus utilizing multi-unit... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3057057