Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-03-09
2003-03-18
Banks-Harold, Marsha D. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S243000, C704S254000
Reexamination Certificate
active
06535850
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to speech recognition and training systems. More specifically this invention relates to Speaker-Dependent (SD) speech recognition and training systems which include means for identifying confusingly similar words during training and means for increasing discrimination between such confusing similar words on recognition.
2. Related Art
A SD system offers flexibility to the user by permitting the introduction of new words into the vocabulary. It also allows vocabulary words from different languages to be included. However, the advantages of user-defined vocabulary and language independence can cause performance degradation if not implemented properly. Allowing a user-defined vocabulary introduces problems due to the flexibility in selecting the vocabulary words. One of the major problems encountered in allowing the user-defined vocabulary is the acoustical similarity of vocabulary words. For example, if “Rob” and “Bob” were selected as vocabulary words, the reliability of the recognition system will decrease.
When the user is given the freedom to choose any vocabulary words, the tendency is to select short words, which are convenient to train but produce unreliable models. Due to the limited training data (one token), the longer the word is, the more reliable the model will be. Finally, when the user enters multiple-word phrase for a vocabulary item, the variation in the length of silence or pause between the words is critical to the success of the recognition system. In unsupervised training, there is no feedback from the system to the user during the training phase. Hence, the models created from such training do not avoid the above identified problems.
To alleviate these problems, a smart/supervised training system needs to be introduced into a SD recognition system particularly if it uses word-based models.
Many methods of SD speech training are present in the related art. For example U.S. Pat. No. 5,452,397 to Ittycheriah, et al., incorporated herein by reference, assumes multiple-token training and describes a method of preventing the entry of confusingly similar phrases in a vocabulary list of a speaker-dependent voice recognition system. The first token of the word/phrase to be added to the vocabulary list, is used to build a model for that word/phrase. Then, the second token (a repetition of the same word/phrase) is compared with the new model added to the vocabulary and also with previously existing models in the vocabulary list. The scores of the existing models are weighted slightly higher than that of the new model. If the second token compares more closely with the an existing model than the new model, the new word/phrase is declared to be confusingly similar to one of the existing vocabulary items then the new model is removed. The user is then asked to select another word/phrase for training. Since this method requires multiple tokens, it is not suitable for a SD system, which requires only a single token for training.
U.S. Pat. No. 5,754,977 to Gardner, et al., incorporated herein by reference, uses a distance value to measure the closeness of the word/phrase to be added with any of the existing vocabulary items. All the vocabulary items are sorted in the order of closeness to the new pattern/model. Then, an Euclidean distance value is computed between the new model and the top entry in the sorted list. If the distance falls below certain predetermined threshold, then the user is warned about the acoustic similarity of the word/phrase to be added with one of the existing vocabulary items and the user is requested to make another entry. Although this approach can be used in a SD system with 1-token training, the method is not very reliable. Since the distribution of the distance values will change significantly from user to user, it is very difficult to determine a reliable threshold value. Even when there is an ability to adjust or change the threshold value from user to user, a priori information such as utterance magnitude, on the distance/score distribution is still required for changing the threshold to a meaningful value.
U.S. Pat. No. 5,664,058 to Vysotsky, incorporated herein by reference, is a speech recognizer training system using one or a few isolated words which are converted to a token. Vysotsky performs multiple tests to determine whether a training utterance is to be accepted or rejected to prevent the user from adding a new voice message, which is similar to a voice message, the recognizer has previously been trained to recognize and insures a consistent pronunciation for all templates corresponding to the same voice message. This approach also requires two or more training tokens to perform these tests. The tests use a distance measure as a criterion for determining the closeness of the new token to the previously stored templates. Even though this approach is more robust than the other two methods, it requires more tokens and more tests than the other methods described above. This technique also uses absolute thresholds, which may not necessarily be uniform across different speakers. Unlike most of the current SD systems, the matching in this approach is performed by Dynamic Time Warping (DTW) which is used to match utterances of a different length than the test speech pattern. Hence the criteria used in this approach are not be directly applicable to systems that use HMM for modeling the speech.
Most of the solutions proposed in the related art assume that more than one token is available during the training phase, for building the models for the vocabulary words. The SD speech recognition system of the present invention requires only one token per vocabulary item for training and since the models built from one-token training are not very robust, performance is improved significantly by identifying and indicating to the user the problem words during the training phase, i.e. smart training.
Also, some of the previous solutions rely on absolute score thresholds to determine the closeness of words. Unfortunately, the same threshold can not be used for every user. Hence, the training can not be completely unsupervised.
Finally, the previous solutions avoid adding only acoustically similar words to the vocabulary. None of the above systems present a solution to resolving entry of confusable words, that is words which are acoustically similar. They fail to address several other problems encountered in training.
The present invention describes a solution for each of the problems described above that cause various degradations in the performance of SD speech recognition systems by using a confidence measure based smart training system which avoids or compensates for similar sounding words in vocabulary. Using duration information, the training process cautions the user about the entries to vocabulary that may be likely sources of frequent errors. Finally, based on the output of smart training, a smart scoring procedure is included in the method described herein to improve the recognition performance in the event the user chooses to include similar sounding words in the vocabulary.
The invention improves the performance and reliability of the SD speech recognition system over the related art systems by avoiding similar sounding entries to the vocabulary during the training, avoiding very short words and other utterances that are likely to cause recognition errors, suggesting alternative solutions, and in the event of user insistence to include similar sounding words in the vocabulary, augments the recognition of such similar sounding words by using a confidence measure instead of absolute scores to determine the acoustic similarity of the vocabulary items and modifies the scoring algorithm during recognition. The present invention also uses additional information such as duration of the utterance and the number of words in a vocabulary item. The smart training process described herein can be applied either to the single-token training or to the multiple-token training.
SUMMA
Banks-Harold Marsha D.
Conexant Systems Inc.
Lerner Martin
Seed IP Law Group PLLC
LandOfFree
Smart training and smart scoring in SD speech recognition... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Smart training and smart scoring in SD speech recognition..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Smart training and smart scoring in SD speech recognition... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3025552