Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-05-28
2001-02-27
Hudspeth, David R. (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S240000, C704S236000, C704S231000, C704S247000
Reexamination Certificate
active
06195639
ABSTRACT:
BACKGROUND
The present invention relates generally to speech recognition systems and, more particularly, to a system and method having reduced memory access requirements associated with the recognition of an isolated word.
Isolated speech recognition is a process in which an unknown spoken utterance (or word) is identified. Through a process known as training, signals representing known words are examined and features of the words are determined and recorded for use as recognition models (or patterns) in a speech recognizer memory. The recognition models represent typical acoustic renditions of known words. In the training process, a training algorithm is applied to the recognition models to form these stored representations that are used to recognize future unknown words.
Speech recognition is generally implemented in three stages, as illustrated in FIG.
1
. In step
100
, an unknown speech signal is received via, for example, a microphone and processed to produce digital data samples. In step
110
, features that are based on a short-term spectral analysis of the unknown speech signal are determined at predetermined time intervals. As will be appreciated, these features, commonly called “feature vectors,” are usually the output of some type of spectral analysis technique, such as a filter bank analysis, a linear predictive coding analysis, or a Fourier transform analysis. In step
120
, the feature vectors are compared to one or more of the recognition models that have been stored during the above-described training process. During this comparison, the degree of similarity between the feature vectors and recognition models is computed. Finally, in step
130
, the speech recognizer determines, based on the recognition model similarity scores, the recognition model that best matches the unknown speech signal. The speech recognizer then outputs the word corresponding to the recognition model having the highest similarity score.
Most of today's speech recognizers are based on the Hidden Markov Model (HMM). As will be appreciated, the HMM provides a pattern matching approach to speech recognition as described in detail in “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” by Lawrence R. Rabiner, Proceedings of the IEEE, Vol. 77, No. 2, February 1989, pp. 257-286, the entirety of which is incorporated by reference herein.
An HMM is generally defined by the following elements:
1. The number of states in the model, N;
2. The state-transition matrix
A
=
[
a
11
a
12
⋯
a
1
⁢
N
a
21
a
22
⋮
⋮
⋰
a
N1
⋯
a
NN
]
,
where a
ij
is the probability of the process moving from state q
i
to state q
j
at time t=1, 2, etc. and given that the process is at state q
i
at time t−1;
3. The observation probability distribution,
b
i
({right arrow over (o)}), i=1, . . . , N for all states q
i
, i=1, . . . , N;
and
4. The initial state probability &pgr;
i
for i=1, 2, . . . , N.
The Viterbi algorithm is commonly used in HMM speech recognizers to perform the comparison and decision operations described in FIG.
1
. The Viterbi algorithm may quite simply be stated as: given the observations {right arrow over (
o
t
+L )}, t=1, 2, . . . , T, where T is the duration of the detected speech measured in number of feature vectors, find the most probable state sequence for each model in the vocabulary and choose the model with the highest probability. The following represents a conventional pattern matching algorithm for performing this task.
For every speech model &lgr;
m
, where m=1, . . . , M, the following processes are performed:
Pre-processing:
π
~
i
=
log
⁢
⁢
(
π
i
)
,
1
≤
i
≤
N
b
~
i
⁡
(
o
→
t
)
=
log
⁡
(
b
i
⁡
(
o
→
t
)
)
,
1
≤
i
≤
N
,
1
≤
t
≤
T
a
~
ij
=
log
⁡
(
a
ij
)
1
≤
i
,
j
≤
N
Initialization:
δ
~
1
=
π
~
1
+
b
~
i
⁡
(
o
→
I
)
,
⁢
1
≤
i
≤
N
Recursion:
δ
~
t
⁡
(
j
)
=
max
1
≤
i
≤
N
⁢
[
δ
~
t
-
1
⁡
(
i
)
+
a
~
ij
]
+
b
~
j
⁡
(
o
→
t
)
,
2
≤
t
≤
T
,
1
≤
j
≤
N
Termination:
P
~
m
*
=
max
1
≤
i
≤
N
⁢
[
δ
~
T
⁡
(
i
)
]
where the score {tilde over (&dgr;)}
t
(j) is an approximation of the logarithm of the probability for the most probably path passing node j at the time t and {tilde over (P)}
m
* is the logarithm of the probability for the most probably path ending at node N at time T. The recognition result (i.e., the word to which the unknown speech signal corresponds) is {circumflex over (&lgr;)}=&lgr;
m
, where
m
=
arg
m
⁢
max
1
≤
m
≤
M
⁢
P
~
m
*
.
The above-described conventional pattern matching algorithm has four stages, namely, a pre-processing stage, an initialization stage, a recursion stage and a termination stage. In the pre-processing stage, logarithmic values of the initial state probability &pgr;
i
for i=1, . . . , N, the description of the feature probability distribution b
i
(
{right arrow over (O)}
t
) where 1≦i≦N and 1≦t≦T, and the state-transition probabilities a
ij
, where i≧1 and j≦N, are computed and stored in memory. The function b
j
(o) and the values of the initial state probability &pgr;
i
and the state-transition probabilities a
ij
generally depend upon the particular speech model &lgr;m being considered. However, in order to decrease the amount of data describing the models, some of the constants are set to be equal regardless of the model. For example, the initial state probabilities are often set to &pgr;
1
=1, &pgr;
i
=0 when i>1 for all the speech models. These logarithmic values that are determined during the pre-processing stage are generally computed and stored during the “training” of a speech recognizer. It will be appreciated that, in those situations where the value of a particular probability is equal to zero, the following convention will be used log (0)=−∞. Since the pre-processing stage is performed once and saved, the cost of this processing stage to most systems is negligible.
In the initialization stage, the path scores {tilde over (&dgr;)}
t
(j) is calculated at time 1 for state i, where 1≦i≦N. This calculation involves fetching the logarithmic values of the state-transition probabilities au and the description of the function {tilde over (b)}
j
(o) for j=1. In the recursion stage, the score {tilde over (&dgr;)}
t
(j) is calculated for state i, ranging from 1 to N, at time t, where 2≦t≦T, and state j, where 1≦j≦N. It will be appreciated that the first score calculation is determined for t=2 and state j ranging from 1 to N. Since the value of j changes for each calculation, the description of the function {tilde over (b)}
j
(o) also changes for each calculation. As such, a memory fetch operation is performed for each calculation involving a different j value in order to retrieve the appropriate description of the function {tilde over (b)}
j
(o). The value of t is then incremented to 3 and the score is calculated for state j ranging from 1 to N. It is evident from this calculation that memory accesses in the order of O(N
2
) are performed during this stage for each model m.
Finally, during the termination stage, the highest probability result (or best path score) for each specific model is determined from the calculations obtained in the recursion stage. An overall best path score is obtained by comparing the best path scores obtained for each model m. Additional information regarding the above-described conventional algorithm can be found in “Fundamentals of Speech Recognition,” by Lawrence R. Rabiner et al., Prentice Hall, 1993, pp. 321-389, which is incorporated by reference herein.
Looking closely at the recursion operation described above, it is apparent that memory accesses in the order O(MN
2
) are needed for each time interval t from 2 to T. Often, the so-called Bakis model (or l
Feltström Alberto Jimenez
Rasmusson Jim
Burns Doane Swecker & Mathis L.L.P.
Hudspeth David R.
Nolan Daniel A.
Telefonaktiebolaget LM Ericsson (publ)
LandOfFree
Matching algorithm for isolated speech recognition does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Matching algorithm for isolated speech recognition, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Matching algorithm for isolated speech recognition will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2563174