Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-03-03
2004-06-29
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S252000
Reexamination Certificate
active
06757652
ABSTRACT:
BACKGROUND
The invention relates to an automatic speech recognizer which uses multiple processing stages to determine the words contained in a spoken utterance.
Real-time speech recognition can be implemented on a variety of types of computers. An implementation of a speech recognizer, in general, uses a digital signal processor, a general purpose processor, or both. Typical digital signal processors (DSPs, such as the Texas Instruments TMS320C31) are suited for computationally intensive tasks, such as signal processing, and for low latency processing. However, available memory to a DSP is generally limited, in part, due to the cost of memory devices that allow the DSPs to execute at their full speed (i.e., without memory wait states). General purpose processors (such as the Intel Pentium) can, in general, support more memory, which is generally less costly than DSP memory, but the processors are not tailored to signal processing tasks.
A speech recognition algorithm implemented on a DSP based computer, in general, has a vocabulary size and linguistic complexity that is limited by memory resources associated with the DSP. More complex speech recognition algorithms, for example supporting larger vocabularies, have been implemented using computers based on general purpose processors, as have “N-best” algorithms that produce multiple alternative hypotheses, rather than a single most best hypothesis, of what was said.
A speech recognition algorithm that is implemented using both a DSP and a general purpose processor often relies on the DSP to perform signal processing tasks, for example computing spectral features at regular time intervals. These spectral features, such as linear predictive coefficients, cepstra, or vector quantized features, are then passed from the DSP to the general purpose processor for further stages of speech recognition.
Speech recognition has been applied to telephone based input. PureSpeech Inc. has previously released a software product, Recite 1.2, that recognizes utterances spoken by telephone callers. A computer architecture on which this product can be executed is shown in FIG.
1
. Computer
100
is used to interact by voice with callers over multiple telephone lines
110
. Computer
100
automatically recognizes what the callers say, and can play prompts to interact with the callers. Computer
100
includes one or more telephone interfaces
130
coupled to a general purpose computer
120
, such as a single-board computer, over a data bus
125
. General purpose computer
120
includes a general purpose processor
122
, working memory
124
, such as dynamic RAM, and non-volatile program memory
126
, such as a magnetic disk. Alternatively, program memory can reside on another computer and be accessed over a data network. Telephone interfaces
130
provide an interface to telephone lines
110
over which callers interact with the computer. Also coupled to general purpose computer
120
over data bus
125
are one or more DSP platforms
140
. DSP platforms
140
are coupled to telephone interfaces
130
over a second bus, time division multiplexed (TDM) bus
150
. TDM bus
150
can carry digitized speech between DSP platforms
140
and telephone interfaces
130
. Each DSP platform
140
includes multiple DSP processors
142
, working memory
144
, a data bus interface
146
to data bus
125
, and a speech interface
148
to TDM bus
150
. In one version of the Recite 1.2 product, general purpose processor
122
is an Intel Pentium, data bus
125
is an ISA bus, DSP platform
140
is an Antares DSP platform (model 2000/30, 2000/50, or 6000) manufactured by Dialogic Corporation, and TDM bus 150 is a SCSA bus which carries telephone signals encoded as 8-bit speech samples sampled at a 8 kHz sampling rate. Each Antares DSP platform includes four DSP processors
142
, TMS320C31 processors manufactured by Texas Instruments. Working memory
144
includes 512 KB of static RAM per DSP and 4 MB of dynamic RAM shared by the four DSP processors
142
. Telephone interfaces
130
are any of several interfaces also manufactured by Dialogic corporation, including models D41ESC, D160SC, and D112SC. For instance, each D112SC interface supports twelve analog telephone lines
110
.
PureSpeech Inc.'s Recite 1.2 product incorporates a speech recognition approach related to that described in U.S. Pat. No. 5,638,487, “AUTOMATIC SPEECH RECOGNITION”, (the '487 patent) which is incorporated herein by reference. In that implementation, each DSP processor on the DSP platforms is associated with exactly one telephone channel. A DSP associated with a particular telephone channel hosts initial stages of the recognition approach that are shown in FIG. 3 of the '497 patent. In addition, an echo canceler stage is also included on the DSP prior to the spectral analyzer in order to reduce the effect of an outbound prompt on an inbound utterance. The DSP is essentially dedicated to the single task (process) for accepting input received from the TDM bus, processing it, and passing it to the general purpose computer. The output of the phonetic classifier is sent to the general purpose computer where a sentence level matcher is implemented. The sentence level matcher can provide multiple sentence hypotheses corresponding to likely utterances spoken by a talker.
In many speech based telephone applications, a caller is talking for a relatively small fraction of the time of a telephone call. The remainder of the time is consumed by playing prompts or other information to the caller, or by quiet intervals, for example while information is being retrieved for the caller. In the Recite 1.2 software product, one DSP is allocated for each telephone interaction, regardless of whether a caller is talking, or a prompt or information is being played. This is necessary, for example, as a caller may begin speaking before a prompt has completed. Therefore, in order to support 12 concurrent telephone conversations, three Antares DSP platforms with four DSPs each are needed to host the initial stages of the recognition approach.
Speech recognition approaches have been adapted to large vocabularies, such as lists of names in the range of 1000 to 10000 names. One aspect of recognition approaches used to achieve adequate accuracy on such large vocabularies is that a large number of subword model parameters, or a large number of subword models themselves, is typically used. A phonetic classifier is hosted on the DSP in the Recite 1.2 software. As the static RAM used for storage related to the subword models, and the amount of static RAM available to each DSP is limited, the number of subword models and their parameters is limited. This memory limitation can impact accuracy on some large vocabulary tasks.
SUMMARY
In one aspect, in general, the invention is software stored on a computer readable medium for causing a multiprocessor computer to perform the function of recognizing an utterance spoken by a speaker. The software includes software for causing a first processor, such as a DSP processor, to perform the function of computing a series of segments associated with the utterance, each segment having a time interval within the utterance, and scores characterizing the degree of match of the utterance in that time interval with a first set of subword units, and sending the series of segments to a second processor. The software also includes software for causing the second processor, such as a general purpose processor, to perform the functions of receiving the series of segments, determining multiple word sequence hypotheses associated with the utterance, and computing scores for the word sequence hypotheses, using a second set of subword units to represent words in the word sequence hypotheses. The first set of subword units can be a set of phonemes, and the second set of subword units can be a set of context dependent phonemes.
In another aspect, in general, the invention is a method for recognizing the words in a spoken utterance. The method includes accepting data for the spoken utterance and f
Fan Wensheng
Lund Michael
Wright Karl
Armstrong Angela
Dorvil Richemond
Koninklijke Philips Electronics , N.V.
LandOfFree
Multiple stage speech recognizer does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multiple stage speech recognizer, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multiple stage speech recognizer will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3333765