Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-04-16
2002-04-30
Knepper, David D. (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S243000
Reexamination Certificate
active
06381571
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to speech recognition and more particularly to determination of utterance recognition parameter.
BACKGROUND OF THE INVENTION
Referring to
FIG. 1
there is illustrated a block diagram of a speech recognition system comprising a source
13
of Hidden Markov Models (HMM) and input speech applied to a recognizer
11
. The result is recognized speech such as text. One of the sources of degradation for speech recognition of the input speech is the distortion due to transducer difference, channel, and speaker variability. Because this distortion is assumed to be additive in the log domain, utterance-based mean normalization in the log domain (or in any linear transformation of log domain, for example, cepstral domain) has been proposed to improve recognizers' robustness. See, for example, S. Furui, “Cepstral Analysis Technique for Automatic Speaker Verification,”
IEEE Trans. Acoust., Speech and Signal Processing
, ASSP-29(2):264-272, 1981. Due to its computational simplicity and substantial improvement in results, such mean normalization has become a standard processing technique for most recognizers.
To do such normalization, the utterance log-spectral mean must be computed over all N frames:
c
_
N
⁢
=
Δ
⁢
1
N
⁢
∑
i
=
1
N
⁢
⁢
c
i
(
1
)
where c
n
is the n
th
log spectral vector. The log spectral vectors are produced by sampling the incoming speech, taking a block or window of samples, performing a discrete Fourier transform on these samples, and performing logarithm of the transform output.
The technique is not suitable for on-line real time operation because, due to the requirement of the utterance mean, the normalized vectors can not be produced until the whole utterance has been observed. In equation 1, {overscore (c)}
N
is the log-spectral vector averaged over N windows. Since N means all N frames the application to real-time system is limited.
To solve this problem, sequential estimation of the mean vector with exponential smoothing techniques has been disclosed. See M. G. Rahim and B. H. Juang, “Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition,”
IEEE Trans. on Speech and Audio Processing
, 4(1): Jan. 19-30, 1996. The sequential determination is that as we get more vectors we get better and better estimates as follows
{overscore (c)}
n
=&agr;·{overscore (c)}
n−1
(past estimate)+(1−&agr;)·c
n
(current input vector) (2)
and the mean-subtracted vector:
ĉ
n
=c
n
−{overscore (c)}
n
(3)
where {overscore (c)}
n
is an estimate of mean up to frame n and &agr; is a weighting value between zero and one.
Among the choices for the initial mean {overscore (c)}
0
and weighting factor a, the prior art discusses two cases.
The first is the cumulative mean removal case where
c
_
0
=
0
⁢
⁢
and
⁢
⁢
α
=
n
-
1
n
(
4
)
Equation 2 reduces to
c
_
n
=
m
_
n
⁢
=
Δ
⁢
1
n
⁢
∑
i
=
1
n
⁢
⁢
c
i
(
5
)
In this-case at time n, the mean vector is approximated by the mean of all vectors observed up to time n. For large n, Equation 5 gives a mean that is very close to the true utterance mean, i.e., it converges to the utterance mean in Equation 1. On the other hand, when {overscore (c)}
0
=0, no prior knowledge of the mean is used, which will make the mean unreliable for short utterances. The second case is called exponential smoothing. The second case sets
{overscore (c)}
0
=mean vector over training data and &agr; is between 0 and 1. (6)
Rearranging Equation 2, we get
c
_
n
=
α
n
·
c
0
+
(
1
-
α
)
⁢
∑
i
=
1
n
⁢
⁢
α
n
-
i
·
c
n
(
7
)
The second term of Equation 7 is a weighted sum of all vectors observed up to time n. Due to the exponential decay of the weights &agr;
n−1
, only the immediate past observed vectors are dominant contributors to the sum, while the more distant past vectors contribute very little. Consequently, for large n the mean given by Equation 7 will not usually be close to the true utterance mean, i.e., asymptotically, exponential smoothing does not give the utterance mean.
SUMMARY OF THE INVENTION
In accordance with one embodiment of the present invention an estimate of the utterance mean is determined by maximum a posterior probability (MAP) estimation. This MAP estimation is subtracted from the log-spectral vector of the incoming signal to be applied to a speech recognizer in a speech recognition system.
REFERENCES:
patent: 5727124 (1998-03-01), Lee et al.
patent: 6151573 (2000-11-01), Gong
Mazin G. Rahim and Biing-Hwang Juang, “Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition,”IEEE Transactions on Speech and Audio Processing, vol. 4, No. 1, pp. 19-30, Jan. 1996.
Gong Yifan
Ramalingam Coimbatore S.
Knepper David D.
Telecky , Jr. Frederick J.
Texas Instruments Incorporated
Troike Robert L.
LandOfFree
Sequential determination of utterance log-spectral mean by... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Sequential determination of utterance log-spectral mean by..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Sequential determination of utterance log-spectral mean by... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2865898