Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-08-14
2001-04-24
Hudspeth, David (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S256000
Reexamination Certificate
active
06223155
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates generally to speech recognition systems, and more particularly to an improved method of developing and employing a garbage model in a speaker-dependent speech recognition system having limited resources such as a cellular telephone.
2. Description of Related Art
The user interfaces of many electronic systems now involve speech recognition technology. There are two general types of speech recognition systems: (1) “speaker independent” (SI) systems; and (2) “speaker dependent” (SD) systems. Some phone companies, for example, have used SI speech recognition technology to create directory assistance mechanisms whereby any user may say the name of the city for which directory assistance is desired. Likewise, some cellular telephones feature SD speech recognition technology so that a particular user may “train” the phone to recognize “call home” and then automatically dial the appropriate number.
Unlike SI systems, SD systems require training. SD systems, however, are normally hampered by having only limited training data because the user of such systems would find it annoying to provide extensive training data. Moreover, SD systems are often used in a portable device, such as a cellular phone, which tend to have severely limited resources in terms of memory and/or computing power because they are necessarily designed within certain size, memory, cost and power constraints. The solutions suitable for implementation in an SI system, therefore, are not generally applicable to an SD system having limited training data, particularly where such SD system is used in a portable device, such as a cellular phone, having limited resources.
All speech recognition systems generally attempt to match an incoming “utterance” with one of a plurality of predetermined “vocabulary” words. In a typical implementation, the acoustic utterance is converted to a digital token, analyzed or decomposed in terms of characteristic “features,” and then simultaneously compared, feature-by-feature, with one or more word models that each represent a vocabulary word.
FIG. 1
, for example, shows a simplified network
20
that assigns an input utterance to one of N predetermined vocabulary words WORD_
1
to WORD_N by finding the best match between certain “features”
200
of the input utterance and one of a plurality of “word models”
20
-
1
to
20
-N. The
FIG. 1
system, however, is subject to “mismatches” and “false acceptances”:
Mismatch
an utterance corresponding to one vocabulary word mistakenly matched with another vocabulary word
False Acceptance
an utterance corresponding to a non-vocabulary word matched with a vocabulary word; or
a non-vocabulary sound such as a lip smack or a cough matched with a vocabulary word.
Most speech recognition systems use some sort of “rejection” scheme to reject certain utterances and sounds that are likely to result in a mismatch or a false acceptance. Rejection of mismatches is desirable because it allows the system to gracefully prompt the user for more spoken input. Rejecting out-of-vocabulary words and non-speech sounds is always desirable because it reduces the rate of false acceptances. Rejection, however, also creates a byproduct called “false rejection”:
False Rejection
a rejection of an utterance corresponding to a vocabulary word
A false rejection is a double-edged sword (could be good, could be bad) depending on what would have occurred in the absence of the false rejection. On the one hand, a false rejection improves recognition accuracy if the vocabulary word would have been mistakenly matched with another vocabulary word anyway (“putative error”). On the other hand, a false rejection degrades performance and annoys the user if the vocabulary word would have been correctly matched in the absence of rejection.
The rejection system should, therefore, maximize the rejection of both out-of-vocabulary words and non-vocabulary sounds, but only reject in-vocabulary words which are putative errors.
The most common rejection models applied in the speech recognition systems are as follows:
1. Parallel Garbage Models
The first and most common approach to rejecting of out-of-vocabulary words and sounds is including an explicit, parallel “garbage” model that represents all such words and sounds. An SI system necessarily uses “generic” garbage models that were developed with a plurality of different speakers. In some cases, nonetheless, a single garbage model is derived from many samples of out-of-vocabulary words, non-speech sounds such as clicks & pops, and samples from background noise/silence signals. Rejection or acceptance of a spoken utterance is determined by measuring the closeness of the utterance to the garbage model.
In other cases, one or more garbage models are used to represent different varieties of non-vocabulary words and sounds. For example, one garbage model may represent the background noise/silence; another may represent coughs, lip smacks, clicks & pops; and yet another may represent out-of-vocabulary words/phrases. The decision process may also vary from system to system. The decision of rejecting a spoken utterance or accepting it as one of the vocabulary words, for example, may be made by comparing the vocabulary model scores to each of the garbage model scores (or to the average score of all the garbage models).
FIG. 2
shows a simplified network
20
′ that is similar to
FIG. 1
, but which includes a parallel garbage model network
30
of “K” garbage models
30
-
1
to
30
-K. Each garbage model
30
-
1
to
30
-K operates in the same basic way as a vocabulary model
20
-
1
to
20
-N, but the utterances that match the garbage models
30
-
1
to
30
-K correspond to those words and sounds that are to be rejected. The user of any give system, of course, is generally expected to limit his utterances to in-vocabulary words. Accordingly, for any given utterance, it is more likely that the user's utterance is an in-vocabulary word as opposed to an out-of-vocabulary word. To reduce false rejections, therefore, the average score of the garbage models
30
-
1
to
30
-K are often subjected to a “penalty,” as shown in
FIG. 2
, before being compared with each of the scores from the word models
20
-
1
to
20
-N to determine the selected word
210
.
2. Absolute Threshold Model
FIG. 3
relates to another rejection approach that is used in both SI and SD systems. It is known as the “absolute threshold model.” The threshold approach to rejection does not use a parallel network of garbage models, but rather relies on a threshold
302
developed with advance knowledge of the system's score distribution
300
for out-of-vocabulary words/sounds on the one hand and the system's score distribution
304
for in-vocabulary words on the other.
FIG. 3
, in particular, shows two smoothed histograms or histogram envelopes
300
,
304
related to a hypothetical SI speech recognition system. The leftmost envelope
300
shows the distribution of tokens versus word score for words or sounds that are not part of the vocabulary, i.e. words or sounds that are garbage. The rightmost envelope
304
shows the distribution of tokens versus word score for in-vocabulary words. The shape of the envelopes
300
,
304
may vary because of random variations in inflection, background noise, and so on. The user, in other words, may speak a vocabulary word and receive a score that is higher or lower than the average peak score. In addition, the system may similarly react to garbage with a range of scores.
Notwithstanding the width or spread of the histogram envelopes
300
,
304
, an over-threshold word score may reliably indicate that the token is an in-vocabulary word provided, of course, that the envelopes
300
,
304
have little or no overlap (ideal). The system simply deems tokens with a word score above the threshold
302
as part of the vocabulary and deems tokens with a word score below the threshold
302
as garbage. So long as the histogram envelopes
300
,
304
do not overlap too much,
Akin Gump Strauss Hauer & Feld L.L.P.
Conexant Systems Inc.
Hudspeth David
Storm Donald L.
LandOfFree
Method of independently creating and using a garbage model... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method of independently creating and using a garbage model..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of independently creating and using a garbage model... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2520353