Concatenation of speech segments by use of a speech synthesizer

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S258000, C704S267000

Reexamination Certificate

active

06366883

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech synthesizer apparatus, and in particular, to a speech synthesizer apparatus for performing speech synthesis of any arbitrary sequence of phonemes by concatenation of speech segments of speech waveform signals extracted at synthesis time from a natural utterance.
2. Description of the Prior Art
FIG. 2
is a block diagram of a conventional speech synthesizer apparatus.
Referring to
FIG. 2
, for example, LPC analysis is executed on signal waveform signal data of a speaker for training, and then acoustic feature parameters including 16-degree cepstrum coefficients are extracted. The extracted acoustic feature parameters are temporarily stored in a feature parameter memory
62
of a buffer memory, and then, are transferred from the feature parameter memory
62
to a parameter time sequence generator
52
. Next, the parameter time sequence generator
52
executes a signal process, including a time normalization process and a parameter time sequence generation process using prosodic control rules stored in a prosodic rule memory
63
, based on the extracted acoustic feature parameters, so as to generate a time sequence of parameters including, for example, the 16-degree cepstrum coefficients, which are required for speech synthesis, and output the generated time sequence thereof to a speech synthesizer
53
.
The speech synthesizer
53
is a speech synthesizer apparatus which has already known to those skilled in the art, and comprises a pulse generator
53
a
for generating voiced speech, a noise generator
53
b
for generating unvoiced speech, and a filter
53
c
whose filter coefficient is changeable. The speech synthesizer
53
switches between voiced speech generated by the pulse generator
53
a
and unvoiced speech generated by the noise generator
53
b
based on an inputted time sequence of parameters, controls the amplitude of the voiced speech or unvoiced speech, and further changes filter coefficients corresponding to transfer coefficients of the filter
53
c
. Then, the speech synthesizer
53
generates and outputs a speech signal of attained speech synthesis to a loudspeaker
54
, and then the speech of the speech signal is outputted from the loudspeaker
54
.
However, in the conventional speech synthesizer apparatus, there has been such a problem that the quality of the resulting voice is considerably poor owing to the fact that the signal processing using the prosodic control rules is required, and to the fact that the speech synthesis is performed based on processed acoustic feature parameters.
SUMMARY OF THE INVENTION
An essential object of the present invention is therefore to provide a speech synthesizer apparatus capable of converting any arbitrary phoneme sequence into uttered speech of speech signal without using any prosodic modification rules and without executing any signal processing, and further obtaining a voice quality closer to the natural voice, as compared with that of the conventional apparatus.
In order to achieve the aforementioned objective, according to one aspect of the present invention, there is provided a speech synthesizer apparatus comprising:
first storage means for storing speech segments of speech waveform signals of natural utterance;
speech analyzing means, based on the speech segments of the speech waveform signals stored in said first storage means and a phoneme sequence corresponding to the speech waveform signals, for extracting and outputting index information on each phoneme of the speech waveform signals, first acoustic feature parameters of each phoneme indicated by the index information, and prosodic feature parameters for each phoneme indicated by the index information;
second storage means for storing the index information, the first acoustic feature parameters, and the prosodic feature parameters outputted from said speech analyzing means;
weighting coefficient training means for calculating acoustic distances in second acoustic feature parameters between one target phoneme from the same phonemic kind and the phoneme candidates other than the target phoneme based on the first acoustic feature parameters and the prosodic feature parameters which are stored in said second storage means, and for determining weighting coefficient vectors for respective target phonemes defining degrees of contribution to the second acoustic feature parameters for respective phoneme candidates by executing a predetermined statistical analysis for each of the second acoustic feature parameters for respective phoneme candidates based on the calculated acoustic distances;
third storage means for storing weighting coefficient vectors for the respective target phonemes determined by the weighting coefficient training means;
speech unit selecting means, based on the weighting coefficient vectors for the respective target phonemes stored in said third storage means, and the prosodic feature parameters stored in said second storage means, for searching for a combination of phoneme candidates which correspond to a phoneme sequence of an input sentence and which minimizes a cost including a target cost representing approximate costs between a target phoneme and the phoneme candidates and a concatenation cost representing approximate costs between two phoneme candidates to be adjacently concatenated, and for outputting index information on the searched out combination of phoneme candidates; and
speech synthesizing means for synthesizing and outputting a speech signal corresponding to the input phoneme sequence by sequentially reading out speech segments of speech waveform signals corresponding to the index information from said first storage means based on the index information outputted from said speech unit selecting means, and by concatenating the read-out speech segments of the speech waveform signals.
In the above-mentioned speech synthesizer apparatus, said speech analyzing means may preferably comprise phoneme predicting means for predicting a phoneme sequence corresponding to the speech waveform signals based on input speech waveform signals.
In the above-mentioned speech synthesizer apparatus, said weighting coefficient training means may preferably determine the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N
1
phoneme candidates based on the calculated acoustic distances, and by executing a linear regression analysis for each of the second acoustic feature parameters.
In the above-mentioned speech synthesizer apparatus, said weighting coefficient training means may preferably determine the weighting coefficient vectors for the respective target phonemes representing the degrees of contribution to the second acoustic feature parameters for the respective phoneme candidates, by extracting a plurality of best top N
1
phoneme candidates based on the calculated acoustic distances, and by executing a statistical analysis using a predetermined neural network for each of the second acoustic feature parameters.
In the above-mentioned speech synthesizer apparatus, said speech unit selecting means may preferably extract a plurality of top N
2
phoneme candidates that are best in terms of the cost including the target cost and the concatenation cost, and thereafter, search for a combination of phoneme candidates that minimizes the cost.
In the above-mentioned speech synthesizer apparatus, the first acoustic feature parameters may preferably include cepstrum coefficients, delta cepstrum coefficients and phoneme labels.
In the above-mentioned speech synthesizer apparatus, the first acoustic feature parameters may preferably include formant parameters and voice source parameters.
In the above-mentioned speech synthesizer apparatus, the prosodic feature parameters may preferably include phoneme durations, speech fundamental frequencies F
0
, and powers.
In the above-mentioned speech synthesi

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Concatenation of speech segments by use of a speech synthesizer does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Concatenation of speech segments by use of a speech synthesizer, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Concatenation of speech segments by use of a speech synthesizer will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2834305

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.