Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-08-09
2003-02-25
McFadden, Susan (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S255000, C704S256000
Reexamination Certificate
active
06526380
ABSTRACT:
The invention relates to a huge vocabulary speech recognition system for recognizing a sequence of spoken words, the system comprising input means for receiving a time-sequential input pattern representative of the sequence of spoken words; and a large vocabulary speech recognizer operative to recognize the input pattern as a sequence of words from the vocabulary using a large vocabulary recognition model associated with the speech recognizer.
From U.S. Pat. No. 5,819,220 a system is known for recognizing speech in an Internet environment. The system is particularly targeted towards accessing information resources on the World Wide Web (WWW) using speech. Building a speech recognition system as an interface to the Web faces very different problems from those encountered in traditional speech recognition domains. The primary problem is the huge vocabulary which the system needs to support, since a user can access virtually any document on any topic. It is very difficult, if not impossible, to build an appropriate recognition model, such as a language model, for those huge vocabularies. In the known system a predetermined recognition model, including a statistical n-gram language model and an acoustic model, is used. The recognition model is dynamically altered using a web-triggered word set. An HTML (HyperText Mark-up Language) document contains links, such as hypertext links, which are used to identify a word set to be included in the final word set for probability boosting the word recognition search. In this way the word set used for computing the speech recognition scores are biased by incorporating the web-triggered word set.
The known system requires a suitable huge vocabulary model as a starting model to be able to obtain a biased model after adaptation. In fact, the biased model can be seen as a conventional large vocabulary model optimized for the current recognition context. As indicated before, it is very difficult to build a suitable huge vocabulary model, also if it is only used as a starting model. A further problem occurs for certain recognition tasks, such as recognizing input for particular Web sites or HTML documents, like those present on search engines or large electronic shops, such as book stores. In such situations the numbers of words which can be uttered is huge. A conventional large vocabulary model will in general not be able to effectively cover the entire range of possible words. Biasing a starting model with relatively few words will not result in a good recognition model. Proper biasing would require a huge additional word set and a significant amount of processing, assuming the starting model was already reasonably good.
It is an object of the invention to provide a recognition system which is better capable of dealing with huge vocabularies.
To achieve the object, the system is characterized in that the system comprises a plurality of N large vocabulary speech recognizers, each being associated with a respective, different large vocabulary recognition model; each of the recognition models being targeted to a specific part of the huge vocabulary; and the system comprises a controller operative to direct the input pattern to a plurality of the speech recognizers and to select a recognized word sequence from the word sequences recognized by the plurality of speech recognizers.
By using several recognizers each with a specific recognition model targeted at a part of the huge vocabulary, the task of building a recognition model for a huge vocabulary is broken down into the manageable task of building large vocabulary models for specific contexts. Such contexts may include health, entertainment, computer, arts, business, education, government, science, news, travel, etc. It will be appreciated that each of those contexts will normally overlap in vocabulary, for instance in the general words of the language. The contexts will differ in statistics of those common words as well in the jargon specific for those contexts. By using several of those models to recognize the input, a wider range of utterances can be recognized using properly trained models. A further advantage of using several models is that this allows a better discrimination during the recognition. If one huge vocabulary was used, certain utterances would only be recognized in one specific meaning (and spelling). As an example, if a user pronounces a word sounding like ‘color’ most of the recognized word sequences will include the very common word ‘color’. It will be less likely that the word ‘collar’ (of a fashion context) is recognized, or ‘collar’ of collared herring (food context), or collar-bone (health context). Those specific words do not have much chance of being recognized in a huge vocabulary which inevitably will be dominated by frequently occurring word sequences of general words. By using several models, each model will identify one or more candidate word sequences from which then a selection can be made. Even if in this final selection a word sequence with ‘color’ gets selected, the alternative word sequences with ‘collar’ in it can be presented to the user.
Preferably, the recognizers operate in parallel in the sense that the user does not experience a significant delay in the recognition. This may be achieved using separate recognition engines each having own processing resources. Alternatively, this may be achieved using a sufficiently powerful serial processor which operates the recognition tasks in ‘parallel’ using conventional time slicing techniques.
It should be noted that using parallel speech recognition engines is known. U.S. Pat. No. 5,754,978 describes using recognition engines in parallel. All of the engines have a relatively high accuracy of, e.g. 95%. If the 5% inaccuracy of the engines does not overlap, the accuracy of recognition can be improved. To ensure that the inaccuracies do not fully overlap, the engines may be different. Alternatively, the engines may be identical in which case the input signal to one of the engines is slightly pertubated or one of the engines is slightly pertubated. A comparator compares the recognized text and accepts or rejects the text based on the degree of agreement between the output of the engines. Since this system requires accurate recognition engines, which do not exist for huge vocabularies, this system provides no solution for huge vocabulary recognition. Neither does the system use different models targeted towards specific parts of a huge vocabulary.
WO 98/10413 describes a dialogue system with an optional number of speech recognition modules which can operate in parallel. The modules are targeted towards a specific type of speech recognition, such as isolated digit recognition, continuous number recognition, small vocabulary word recognition, isolated large vocabulary recognition, continuous word recognition, keyword recognition, word sequence recognition, alphabet recognition, etc. The dialogue system knows up front which type of input the user will supply and accordingly activates one or more of the specific modules. For instance, if the user needs to speak a number, the dialogue engine will enable the isolated digit recognition and the continuous number recognition, allowing the user to speak the number as digits or as a continuous number. The system provides no solution for dealing with huge vocabularies
The recognition models of the system according to the invention may be predetermined. Preferably, as defined in dependent claim 2, a model selector is used to dynamically select at least one of the models actively used for recognition. The selection depends on the context of the user input, like the query or dictation subject. Preferably, the model selector selects many of the recognition models. In practice, at least one of the models will represent the normal day-to-day vocabulary on general subjects. Such a model will normally always be used.
In an embodiment as defined in dependent claim 3, the document defines the recognition context. As defined in the dependent claim 5, this may be done by scanning the words
Besling Stefan
Thelen Eric
Ullrich Meinhard
Koninklijke Philips Electronics , N.V.
McFadden Susan
Piotrowski Daniel J.
LandOfFree
Speech recognition system having parallel large vocabulary... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speech recognition system having parallel large vocabulary..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech recognition system having parallel large vocabulary... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3152868