Adaptive speech recognition method with noise compensation

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06662160

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of speech recognition and, more particularly, to an adaptive speech recognition method with noise compensation.
2. Description of Related Art
It is no doubt that the robustness issue is crucial in the area of pattern recognition because, in real-world applications, the mismatch between training and testing data may occur to severely degrade the recognition performance considerably. For such a speech recognition problem, the mismatch comes from the variability of inter- and intra-speakers, transducers/channels and surrounding noises. For instance, considering the application of speech recognition for hands-free voice interface in a car environment, the non-stationary surrounding noises of engine, music, babble, wind, echo under different driving speeds will vary and hence deteriorate the performance of the recognizer.
To solve the problem, a direct method is to collect enough training data from various noise conditions to generate speech models, such that proper speech models can be selected based on the environment of a specific application. However, such a method is impractical for the application in a car environment because of the complexity of noise and the tremendous amount of training data to be collected. In addition, the method requires additional mechanism to detect the change in the environment, and such environmental detector is difficult to design.
Alternatively, a feasible approach is to build an adaptive speech recognizer where the speech models can be adapted to new environments using environment-specific adaptation data.
In the context of statistical speech recognition, the optimal word sequence W of an input utterance X={x
t
} is determined according to the Bayes rule:
Ŵ=arg
w
max
p
(
W|X
)=arg
w
max
p
(
X|W
)
p
(
W
),  (1)
where p(X|W) is the occurrence probability of X when the word sequence of X is W, and p(W) is the occurrence probability of word W based on the prior knowledge of word sequence. The description of such a technique can be found in RABINER, L. R.: ‘A tutorial on hidden Markov models and selected applications in speech recognition’, Proceedings of IEEE, 1989, vol. 77, pp. 257-286, which is incorporated herein for reference. Using a Markov chain to describe the change of the feature of the speech parameters, the p(X|W) can be further expressed, based on the HMM (Hidden Markov Model) theory, as follows:
p



(
X
|
W
)
=

all



S




p



(
X
,
S
|
W
)
=

all



S




p



(
X
|
S
,
W
)



p



(
S
|
W
)
,
(
2
)
where S is the state sequence of the speech signal X.
In general, the computations of (1) and (2) are very expensive and almost unreachable because all possible S must be considered. One efficient approach is to apply the Viterbi algorithm and decode the optimal state sequence Ŝ={ŝ
t
}, as described in VITERBI, A. J.: ‘Error bounds for conventional codes and an asymptotically optimal decoding algorithm’, IEEE Trans. Information Theory, 1967, vol. IT-13, pp. 260-269, which is incorporated herein for reference. As such, the summation over all possible state sequences in (2) is accordingly approximated by the single most likely state sequence, i.e.
p



(
X
|
W
)

p



(
X
|
S
^
,
W
)

p



(
S
^
|
W
)
=
π
s
^
0




t
=
1
T



a
s
^
i
-
1

s
^
t

b
s
^
t



(
x
t
)
,
(
3
)
where &pgr;
ŝ
o
is the initial state probability, a
ŝ
r−l
ŝ
t
is the state transition probability and b
ŝ
t
(x
t
) is the observation probability density function of x
t
in state ŝ
t
, which is modeled by a mixture of multivariate Gaussian densities; that is:
b
s
^
i



(
x
t
)
=
p



(
x
t
|
s
^
t
=
i
,
W
)
=

k
=
1
K



ω
ik



f



(
x
t
|
θ
ik
)
=

k
=
1
K



ω
ik



N



(
x
l
|
μ
ik
,

ik
)
.
(
4
)
Herein, &ohgr;
ik
is mixture weight, and &mgr;
ik
and &Sgr;
ik
are respectively the mean vector and covariance matrix of the k-th mixture density function for the state ŝ
t
=i. The occurrence probability f(x
t
|&thgr;
ik
) of frame x
t
associated with the density function &thgr;
ik
=(&mgr;
ik
,&Sgr;
ik
) is expressed by:
f
(
x
t
|&thgr;
ik
)=(2&pgr;)
−D/2
|&Sgr;
ik
|
−½
exp[−½(
x
t
−&mgr;
ik
)′&Sgr;
ik
−1
(
x
t
−&mgr;
ik
)].  (5)
The construction of speech recognition system is achieved by determining the HMM parameters, such as {&mgr;
ik
,&Sgr;
ik
} {&ohgr;
ik
} and {a
ij
}. The speech recognition system is thus operated by using Viterbi algorithm to determine the optimal word sequence for the input speech. However, the surrounding noises will cause a mismatch between the speech features of the application environment and the training environment. As a result, the established HMM's can not correctly describe the input speech, and the recognition rate is decreased. Particularly in the car environment, the noises are so adverse so that the performance of the speech recognition system in the car is much lower than that in a clean environment. Therefore, in order to implement, for example, an important application for human-machine voice interface in car environments, an adaptive speech recognition method with noise compensation is desired, so as to promote the recognition rate.
Moreover, Mansour and Juang observed that the additive white noise would cause the norm shrinkage of speech cepstral vector, and a description of such can be found in MANSOUR, D. and JUANG, B. -H.: ‘A family of distortion measures based upon projection operation for robust speech recognition’, IEEE Trans. Acoustic, Speech, Signal Processing, 1989, vol. 37, pp. 1659-1671, which is incorporated herein for reference. They consequently designed a distance measure where a scaling factor was introduced to compensate the cepstral shrinkage for cepstrum-based speech recognition. This approach was further extended to the adaptation of HMM parameters by detecting an equalization scalar &lgr; between probability density function unit &thgr;
ik
and noisy speech frame x
t
, as described in CARLSON, B. A. and CLEMENTS, M. A.: ‘A projection-based likelihood measure for speech recognition in noise’, IEEE Transactions on Speech and Audio Processing, 1994, vol. 2, no. 6, pp. 97-102, which is incorporated herein for reference. The probability measurement in (5) is modified to:
f
(
x
t
|&lgr;,&thgr;
ik
)=(2&pgr;)
−D/2
|&Sgr;
ik
|
−½
exp[−½(
x
t
−&lgr;&mgr;
ik
)′&Sgr;
ik
−1
(
x
t
−&lgr;&mgr;
ik
)].  (6)
The optimal equalization factor &lgr;
e
is determined by directly maximizing the logarithm of (6) as follows:
λ
e
=
arg



max
λ



log



f



(
x
t
|
λ
,
θ
ik
)
=
x
t





ik
-
1



μ
ik
μ
ik





ik
-
1



μ
ik
.
(
7
)
Geometrically, this factor is equivalent to the projection of x
t
upon &mgr;
ik
weighted by &Sgr;
ik
−1
. The use of &lgr;
e
to compensate the influence of the white noise is proved to be helpful in increasing the speech recognition rate. However, for the problem of speech recognition in car environments, the surrounding noise is non-white and sophisticated to characterize. It is thus insufficient to adapt the HMM mean vector &mgr;
ik
by only applying the optimal equalization scalar &lgr;
e
. Therefo

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Adaptive speech recognition method with noise compensation does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Adaptive speech recognition method with noise compensation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Adaptive speech recognition method with noise compensation will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3163985

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.