Speaker normalization processor apparatus for generating...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S234000, C704S256000

Reexamination Certificate

active

06236963

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speaker normalization processor apparatus and a speech recognition apparatus together with the speaker normalization apparatus, and in particular, to a speaker normalization processor apparatus for generating a speaker-normalized optimal hidden Markov model (hereinafter, a hidden Markov model will be referred to as an HMM) based on speaker-normalizing speech waveform data of a plurality of training speakers, using a function for normalizing input frequencies to be directed to average Formant frequencies and by then training an initial HMM based on the speaker-normalized speech waveform data, and also relates to a speech recognition apparatus for performing speech recognition by using the generated HMM.
2. Description of the Prior Art
Conventionally, as a technique for speaker normalization, a speaker normalization technique using frequency warping with attention focused on vocal tract length (hereinafter, referred to as a prior art example) has been proposed, and its effectiveness has been reported (See, for example, Prior Art Document 1, P. Zhan et al., “Speaker Normalization Based on Frequency Warping”, Proceeding of ICASSP, pp. 1039-1042, 1997). The speaker normalization technique based on the likelihood in this prior art example is a method comprising the steps of, using a plurality of frequency warping functions prepared in advance, performing frequency warping using these functions and then acoustic analysis, determining resultant likelihoods at which acoustic parameters are outputted from an initial acoustic model, and selecting the warping function having the highest likelihood. Hereinbelow, the method of selecting an optimal frequency warping function based on the likelihood as well as the procedure for speaker normalization training are explained.
First of all, the method of selecting a frequency warping function will be explained. In this case, as shown in
FIG. 17
, a frequency warping function optimal to each speaker is selected from a plurality of N frequency warping functions F &egr; f
1
, f
2
, . . . , f
N
according to the following procedure:
(A1) Feature extractors
31
-
1
to
31
-N perform frequency warping process for speech waveform data of one speaker m, using the frequency warping functions F &egr; f
1
, f
2
, . . . , f
N
prepared in advance, and then, perform acoustic analysis;
(A2) A likelihood calculator
32
determines a likelihood by Viterbi search using correct-solution phoneme series with a lookup to a predetermined phoneme HMM
33
with respect to each of acoustic analysis results obtained by above (A1);
(A3) A maximum likelihood selector
34
selects a frequency warping function f
max
that gives a maximum likelihood among the frequency warping functions f
1
, f
2
, . . . f
N
based on results of above (A2); and
(A4) A feature extractor
35
performs frequency warping process for inputted speech waveform data of the speaker m using the frequency warping function f
max
, and then, acoustic analysis, thereby outputting normalized feature parameters. These feature parameters are used for, for example, speech recognition.
Next, the procedure for speaker normalization training will be explained. It is assumed here that, for the training, two different speech data sets, speech data for the selection of a frequency warping function and speech data for training, are used.
(B1) Acoustic analysis of speech waveform data for adaptation or training of all the training speakers is performed, by which acoustic feature parameters are obtained. For these acoustic feature parameters, mel-frequency cepstrum coefficients or the like, which have been known to those skilled in the art, is used;
(B2) The frequency warping function f
max
that gives a maximum likelihood on the speech data for the selection of a frequency warping function of each training speaker is selected based on a trained acoustic model &Lgr;
i
;
(B3) Frequency warping using the frequency warping function selected for each speaker, and then, acoustic analysis of the speech data for training, are performed, by which the acoustic feature parameters are determined;
(B4) The acoustic model &Lgr;
i
is trained based on acoustic analysis results obtained by above (B3); and
(B5) Then, the process of (B2)-(B4) is repeated to a designated number of times.
FIG. 18
is a graph showing examples of frequency warping functions in the prior art example. The function shown in
FIG. 18
represents the correspondence between frequencies before and after performing the frequency warping by a linear frequency warping function determined by a frequency warping coefficient &agr;. With a coefficient &phgr; determined, if the normalized frequency f of input speech is not more than &phgr;, the frequency warping function is given by the following equation:
f′=&agr;·f
for 0
<f≦&phgr;
  (1),
and when the frequency f of input speech is within a range of &phgr; to one, the frequency warping function is given by the following line that interconnects coordinates (&phgr;, f·&phgr;) and coordinates (1.0, 1.0) shown in FIG.
18
:

f
′={(&agr;·&phgr;−1)·
f
−(&agr;−1)·&phgr;}/(&phgr;−1) for &phgr;<
f≦
1.0.  (2)
For the execution of speaker normalization, a plurality of frequency warping functions different in this frequency warping coefficient a from one another are prepared, and among those, a frequency warping function having a maximum likelihood is selected. The terms “frequency warping” is referred herein to as a process of shifting each frequency of speech waveform data of one target speaker to its corresponding average frequency of all the speakers by using, for example, the frequency warping functions of FIG.
18
.
However, for the method of the prior art example, it is necessary to previously specify the configuration of the frequency warping function. Also, since the frequency warping coefficient &agr; is given as a discrete value, there has been a problem that detailed frequency warping functions could not be estimated. Further, when speech recognition is performed using an HMM speaker-normalized and trained by the speaker normalization method of the prior art example, there has been a problem that significant improvement in the speech recognition rate by normalization could not be obtained.
SUMMARY OF THE INVENTION
An essential object of the present invention is to provide a speaker normalization processor apparatus capable of generating an acoustic model of high recognition performance by estimating a frequency warping function from target speaker to standard speaker at higher accuracy as compared with the prior art example, and by performing speaker normalization and training using the estimated frequency warping function.
Another object of the present invention is to provide a speech recognition apparatus capable of accomplishing speech recognition at a higher speech recognition rate as compared with the prior art example by using an HMM generated by the speaker normalization processor.
In order to achieve the above-mentioned objective, according to one aspect of the present invention, there is provided a speaker normalization processor apparatus comprising:
a first storage unit for storing speech waveform data of a plurality of normalization-target speakers and text data corresponding to the speech waveform data;
a second storage unit for storing Formant frequencies of a standard speaker determined based on a vocal-tract area function of the standard speaker;
estimation means for estimating feature quantities of a vocal-tract configuration showing an anatomical configuration of a vocal tract of each normalization-target speaker, by looking up to a correspondence between vocal-tract configuration parameters and Formant frequencies previously determined based on a vocal tract model of the standard speaker, based on the speech waveform data of each normalization-target speaker stored in said first storage unit;
function generating me

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speaker normalization processor apparatus for generating... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speaker normalization processor apparatus for generating..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speaker normalization processor apparatus for generating... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2570614

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.