Telephonic communications – Audio message storage – retrieval – or synthesis – Digital signal processing
Reexamination Certificate
1997-10-03
2002-12-31
Hoosain, Allan (Department: 2645)
Telephonic communications
Audio message storage, retrieval, or synthesis
Digital signal processing
C379S088010, C379S067100, C379S088080, C379S088130, C379S218010, C379S093120, C379S088210
Reexamination Certificate
active
06501833
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates to speech recognition systems and, more particularly, to large-vocabulary speech recognition systems. The system described herein is suitable for use in systems providing interactive natural language discourse.
2. The Prior Art
Speech recognition systems convert spoken language to a form that is tractable by a computer. The resultant data string may be used to control a physical system, may be output by the computer in textual form, or may be used in other ways.
An increasingly popular use of speech recognition systems is to automate transactions requiring interactive exchanges. An example of a system with limited interaction is a telephone directory response system in which the user supplies information of a restricted nature such as the name and address of a telephone subscriber and receives in return the telephone number of that subscriber. An example of a substantially more complex such system is a catalogue sales system in which the user supplies information specific to himself or herself (e.g., name, address, telephone number, special identification number, credit card number, etc.) as well as further information (e.g., nature of item desired, size, color, etc.) and the system in return provides information to the user concerning the desired transaction (e.g., price, availability, shipping date, etc.).
Recognition of natural, unconstrained speech is very difficult. The difficulty is increased when there is environmental background noise or a noisy channel (e.g., a telephone line). Computer speech recognition systems typically require the task to be simplified in various ways. For example, they may require the speech to be noise-free (e.g., by using a good microphone), they may require the speaker to pause between words, or they may limit the vocabulary to a small number of words. Even in large-vocabulary systems, the vocabulary is typically defined in advance. The ability to add words to the vocabulary dynamically (i.e., during a discourse) is typically limited, or even nonexistent, due to the significant computing capabilities required to accomplish the task on a real-time basis. The difficulty of real-time speech recognition is dramatically compounded in very large-vocabulary applications (e.g., tens of thousands of words or more).
One example of an interactive speech recognition system under current development is the SUMMIT speech recognition system being developed at M.I.T. This system is described in Zue, V., Seneff, S., Polifroni, J., Phillips, M., Pao, C., Goddeau, D., Glass, J., and Brill, E. “The MIT ATIS System: December 1993 Progress Report.”
Proc. ARPA Human Language Technology Workshop,
Princeton, N.J. March 1994, among other papers. Unlike most other systems which are frame-based systems, (the unit of the frame typically being a 10 ms portion of speech), the SUMMIT speech recognition system is a segment-based system, the segment typically being a speech sound or phone.
In the SUMMIT system, the acoustic signal representing a speaker's utterances is first converted into an electrical signal for signal processing, The processing may include filtering to enhance subsequent recognizability of the signal, remove unwanted noise, etc. The signal is converted to a spectral representation, then divided into segments corresponding to hypothesized boundaries of individual speech sounds (segments). The network of hypothesized segments is then passed to a phonetic classifier whose purpose is to seek to associate each segment with a known “phone” or speech sound identity. Because of uncertainties in the recognition process, each segment is typically associated with a list of several phones, with probabilities associated with each phone. Both the segmentation and the classification are performed in accordance with acoustic models for the possible speech sounds.
The end product of the phonetic classifier is a “lattice” of phones, each phone having a probability associated therewith. The actual words spoken at the input to the recognizer should form a path through this lattice. Because of the uncertainties of the process, there are usually on the order of millions of possible paths to be considered, each of different overall probability. A major task of the speech recognizer is to associate the segments along paths in the phoneme lattice with words in the recognizer vocabulary to thereby find the best path.
In prior art systems, such as the SUMMIT system, the vocabulary or lexical representation is a “network” that encodes all possible words that the recognizer can identify, all possible pronunciations of these words, and all possible connections between these words. This vocabulary is usually defined in advance, that is, prior to attempting to recognize a given utterance, and is usually fixed during the recognition process. Thus, if a word not already in the system's vocabulary is spoken during a recognition session, the word will not successfully be recognized.
The structure of current lexical representation networks does not readily lend itself to rapid updating when large vocabularies are involved, even when done on an “off-line” basis, that is, in the absence of speech input. In particular, in prior art lexical representations of the type exemplified by the SUMMIT recognition system, the lexical network is formed as a number of separate pronunciation networks for each work in the vocabulary, together with links establishing the possible connections between words. The links are placed based on phonetic rules. In order to add a word to the network, all words presently in the vocabulary must be checked in order to establish phonetic compatibility between the respective nodes before the links are established. This is a computationally intensive problem whose difficulty increases as the size of the vocabulary increases. Thus, the word addition problem is a significant issue in phonetically-based speech recognition systems.
In present speech recognition systems, a precomputed language model is employed during the search through the lexical network to favor sequences of words which are likely to occur in spoken language. The language model can provide the constraint to make a large vocabulary task tractable. This language model is generally precomputed based on the predefined vocabulary, and thus is generally inappropriate for use after adding words to the vocabulary.
SUMMARY OF THE INVENTION
A. Objects of the Invention
Accordingly, it is an object of the invention to provide an improved speech recognition system.
A further object of the invention is to provide a speech recognition system which facilitates the rapid addition of words to the vocabulary of the system.
Still a further object of the invention is to provide an improved speech recognition system which facilitates vocabulary addition during the speech recognition process without appreciably slowing the speech recognition process or disallowing use of a language model.
Yet another object of the invention is to provide a speech recognition system which is particularly suited to active vocabularies on the order of thousands of words and greater and total vocabularies of millions of words and greater.
Still a further object of the invention is to provide a speech recognition system which can use constraints from large databases without appreciably slowing the speech recognition process.
BRIEF DESCRIPTION OF THE INVENTION
In accordance with the present invention, the lexical network containing the vocabulary that the system is capable of recognizing includes a number of constructs (defined herein as “word class” nodes, “phonetic constraint” nodes, and “connection” nodes) in addition to the word begin and end nodes commonly found in speech precognition systems. (A node is a connection point within the lexical network. Nodes may be joined by arcs to form paths through the network. Some of the arcs between nodes specify speech segments, i.e., phones.) These constructs effectively precompile and organize both phonetic and syntactic/sema
Nguyen John N.
Phillips Michael S.
Hale and Dorr LLP
Hoosain Allan
SpeechWorks International, Inc.
LandOfFree
Method and apparatus for dynamic adaptation of a large... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for dynamic adaptation of a large..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for dynamic adaptation of a large... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2965999