Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-12-22
2001-04-24
Dorvil, Richemond (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S247000
Reexamination Certificate
active
06223159
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speaker adaptation device which selects one of a plurality of prepared standard patterns on the basis of the speech characteristics of a speaker, as well as to a speech recognition device which recognizes speech through use of the thus-selected speaker-dependent standard pattern.
2. Description of the Related Art
As shown in
FIG. 7
, in a speaker adaptation device described in, for example, Kosaka et al., “Structured Speaker Clustering for Speaker Adaptation” (Technical Report by Electronic Information Communications Association, SP 93 to 110, 1993), voice feature quantity extraction means
1
subjects a speaker's voice
101
, which will be separately input, to an acoustic feature quantity analysis, thereby extracting feature vector time-series data Ou=[ou(1), ou(2), . . . , ou(Tu)] (where Tu represents the maximum number of speaker voice frames). Speaker-dependent standard pattern selection means
6
a
selects and outputs as a speaker-dependent standard pattern
104
a speaker-dependent standard pattern which has the maximum probability of matching the speaker's voice
101
, selects a reference speaker-dependent standard pattern from reference speaker-dependent standard pattern storage means
9
, and subjects the thus-selected reference speaker-dependent standard pattern to hidden Markov model (HMM) probability computation, through use of the feature vector time-series data extracted by the speech feature quantity extraction means
1
. Reference speaker-dependent standard pattern learning means
7
generates reference speaker-dependent standard patterns &lgr;s (1) to &lgr;s(M) for reference speaker numbers
1
to M, through use of a reference speaker speed data feature vector
102
and an initial standard pattern
103
, which are prepared separately. With the reference speaker-dependent standard patterns &lgr;s (1) to &lgr;s(M), an adaptive mean vector &mgr;al(j,k) is estimated and learned from the speech data regarding a speaker
1
, with regard to a k-th HMM mean vector &mgr;I(j,k) in state “j,” which is the initial standard pattern
103
, by means of a transfer-vector-field smoothing speaker adaptation method (for further information about the method, see Okura et al., “Speaker Adaptation Based on Transfer Vector Field Smoothing Model with Continuous Mixture Density HMMs”, Technical Report by Electronic Information Communications Association, SP 92 to 16, 1992). Reference speaker-group-dependent pattern learning means
8
defines and clusters the distance among the reference speaker-dependent standard patterns &lgr;s(1) to &lgr;s(M) produced by the reference speaker-dependent standard pattern learning means
7
, by means of a Bhattacharyya distance to thereby produce reference speaker-group-dependent standard patterns &lgr;g(1) to &lgr;g(N) for reference speaker group numbers
1
to N, through use of reference speaker-dependent standard patterns which are grouped by means of, e.g., K-mean algorithm (for further information about the algorithm, see L. Rabiner et al., “Fundamentals of Speech Recognition,” translated by Kei FURUI, NTT Advanced Technology Company Ltd., 1995). Reference speaker-dependent standard pattern storage means
9
stores the reference speaker-dependent standard patterns &lgr;s(1) to &lgr;s(M) produced by the reference speaker-group-dependent standard pattern learning means
7
and the reference speaker-group-dependent standard patterns &lgr;g(1) to &lgr;g(N) produced by the reference-speaker-dependent standard pattern learning means
8
.
The conventional speaker adaptation device adopts a speaker adaptation method (a speaker adaptation method based on standard pattern selection). Under this method, a plurality of reference speaker-dependent standard patterns are prepared beforehand, through use of a hidden Markov model [HMM, or an speaker independent standard pattern which is described in detail in, e.g., “Fundamentals of Speech Recognition” and is prepared beforehand from speech data regarding an speaker-independent speaker (such as words or sentences) through standard pattern learning operations]. A speaker-dependent standard pattern is selected on the basis of the characteristics of the speaker's speech.
The reference-speaker-group-dependent standard pattern learning means
8
estimates the k-th mean vector &mgr;gn (j,k) and a covariance matrix Ugn (j,k) about group “n” which is in state “j” with regard to the generated reference-speaker-group standard pattern, by means of Equation
1
provided below. Here, &mgr;gn(j,k) represents the i-th mean vector in the group “n” with regard to the reference speaker-dependent standard pattern, and uai (j,k) represents a covariance matrix. Further, I represents the number of reference speaker-dependent standard patterns in the group “n,” and “t” represents a transposed matrix.
(
Equation
⁢
⁢
1
)
_
μ
⁢
⁢
gn
⁡
(
j
,
k
)
=
1
I
⁢
∑
i
=
1
I
⁢
uai
⁡
(
j
,
k
)
⁢


⁢
Ugn
⁡
(
j
,
k
)
=
1
I
⁢
(
∑
i
=
1
I
⁢
Uai
⁡
(
j
,
k
)
+
∑
i
=
1
I
⁢
uai
⁡
(
j
,
k
)
⁢
uai
⁡
(
j
,
k
)
t
-
I
·
ugn
⁡
(
j
,
k
)
⁢
ugn
⁡
(
j
,
k
)
t
)
Eq
⁢
.1
The reference speaker-dependent standard pattern storage means
9
uses an HMM having an initial HMM Gaussian distribution number of 810 whose mean vector dimension number is 34 per-standard pattern. For example, with regard to a standard pattern number of 484 which is a sum of a reference speaker-dependent standard pattern number of 279 and a reference speaker-group-dependent standard pattern number of 205, there must be stored 13,329,360 data sets (=484×810×34) for merely a mean vector.
The speaker's voice
101
corresponds to the voice produced as a result of a speaker using the system speaking predetermined words or sentences beforehand.
The reference speaker speech data feature vector
102
corresponds to a feature vector (e.g., a physical quantity expressing the voice characteristics in a small amount of data, such as Cepstrum or a Cepstrum differential) which is extracted by subjecting multiple speaker voice data to an acoustic feature quantity analysis. In the case of the number of reference speakers being M, there are feature vector time-series data O(1) to O(M) [O(1) designates time-series signals {o (1,1), o (1,2), . . . , o (1,T1)}, where T
1
is the number of speech data frames of a reference speaker
1
].
The initial standard pattern
103
corresponds to an initial standard pattern &lgr;I[e.g., 200 states (5 mixture/state) phoneme HMM and 1 state (10 mixture) silent HMM] prepared beforehand.
For example, as shown in
FIG. 8
, in the common speech recognition device which uses a conventional speaker adaptation method based on standard pattern selection, the voice feature quantity extraction means
11
operates for a speaker's voice
101
a
to be recognized (i.e., the voice produced as a result of a speaker using the system speaking words and sentences to be recognized), which will be input separately, in the same manner as used by the voice feature quantity extraction means
1
shown in FIG.
6
A. Matching means
12
recognizes speech from the feature vector time-series data produced by the voice feature quantity extraction means
11
, by comparison of the time-series data with the speaker-dependent standard pattern
104
produced by the speaker adaptation device based on the standard pattern selection method.
Compared with a speaker adaptation method based on a mapping method [a structural model introduction method regarding a personal error under which a mapping relationship derived between an initial standard pattern and a speaker's standard pattern by means of a small amount of learned data, e.g., a specific standard pattern learning method which uses a conversion factor obtained by means of a multiple regression mapping model and which is described in “M. J. F. Gales et al. , M
Dorvil Richemond
Mitsubishi Denki & Kabushiki Kaisha
LandOfFree
Speaker adaptation device and speech recognition device does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speaker adaptation device and speech recognition device, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speaker adaptation device and speech recognition device will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2500019