Speech synthesizing method and apparatus

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S267000, C704S268000

Reexamination Certificate

active

06832192

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to a speech synthesizing method and apparatus and, more particularly, to power control on synthesized speech in a speech synthesizing process.
BACKGROUND OF THE INVENTION
As a speech synthesizing method of obtaining desired synthesized speech, a method of generating synthesized speech by editing and concatenating speech segments in units of phonemes or CV/VC, VCV (C: Consonant; V: vowel), and the like is known.
FIGS. 10A
to
10
D are views for explaining CV/VC and VCV as speech segment units. As shown in
FIGS. 10A
to
10
D, CV/VC is a unit with a speech segment boundary set in each phoneme, and VCV is a unit with a speech segment boundary set in a vowel.
FIGS. 11A
to
11
D are views schematically showing an example of a method of changing the duration length and fundamental frequency of one speech segment. As shown in
FIG. 11C
, a speech waveform
1101
of one speech segment shown in
FIG. 11A
is divided into a plurality of small speech segments
1103
by a plurality of window functions
1102
in FIG.
11
B. In this case, for a voiced sound portion (a voiced sound region in the second half of a speech waveform), a window function having a time width synchronous with the pitch of the original speech is used. For an unvoiced sound portion (an unvoiced sound region in the first half of the speech waveform), a window function having an appropriate time width (longer than that for a voiced sound portion) is used.
By repeating a plurality of small speech segments obtained in this manner, thinning out some of them, and changing the intervals, the duration length and fundamental frequency of synthesized speech
1104
can be changed as shown in FIG.
11
D. For example, the duration length of synthesized speech can be reduced by thinning out small speech segments, and can be increased by repeating small speech segments. The fundamental frequency of synthesized speech can be increased by reducing the intervals between small speech segments of a voiced sound portion, and can be decreased by increasing the intervals between the small speech segments. By superimposing a plurality of small speech segments obtained by such repetition, thinning out, and interval changes, synthesized speech having a desired duration length and fundamental frequency can be obtained.
Power control for such synthesized speech can be performed as follows. Synthesized speech having a desired average power can be obtained by obtaining an estimated value p
0
of the average power of speech segments (corresponding to a target average power) and an average power p of the synthesized speech obtained by the above procedure, and multiplying the synthesized speech obtained by the above procedure by (p/p
0
)
1/2
. That is, power control is executed in units of speech segments.
The above power control method suffers the following problems.
The first problem is associated with mismatching between a power control unit and a speech segment unit.
To perform stable power control, power control must be performed in units of periods of time with a certain length. In addition, a power variation needs to be small within a power control unit. As a unit that satisfies these conditions, a phoneme or the like may be used. However, the above unit like CV/VC or VCV has a phoneme boundary with a large variation within a speech segment, and hence the power variation is large in each speech segment. Therefore, this unit is not suitable as a power control unit.
A voiced sound portion greatly differs in power from an unvoiced sound portion. Basically, since a voiced/unvoiced sound can be uniquely determined from a phoneme type, the above difference poses no problem if the average power value of each type of phoneme is estimated. A close examination, however, reveals that there are exceptions to the relationship between phoneme types and voice/unvoiced sounds, and mismatching may occur. In addition, a phoneme boundary may differ from a voiced/unvoiced sound boundary by several msec to ten-odd msec. This is because a phoneme type and phoneme boundary are mainly determined by a vocal tract shape, whereas a voiced/unvoiced sound is determined by the presence/absence of vocal cord vibrations.
SUMMARY OF THE INVENTION
The present invention has been made in consideration of the above problems, and has as its object to perform proper power control even if a phoneme unit with power greatly varying within a speech segment is set as a unit for waveform edition.
In order to achieve the above object, according to the present invention, there is provided a speech synthesizing method comprising the division step of acquiring partial speech segments by dividing a speech segment in a predetermined unit with a phoneme boundary, the estimation step of estimating a power value of each partial speech segment obtained in the division step on the basis of a target power value, the changing step of changing the power value of each of the partial speech segments on the basis of the power value estimated in the estimation step, and the generating step of generating synthesized speech by using the partial speech segments changed in the changing step.
In order to achieve the above object, according to the present invention, there is provided a speech synthesizing apparatus comprising division means for acquiring partial speech segments by dividing a speech segment in a predetermined unit with a phoneme boundary, estimation means for estimating a power value of each partial speech segment obtained by the division means on the basis of a target power value, changing means for changing the power value of each of the partial speech segments on the basis of the power value estimated by the estimation means, and the generating means for generating synthesized speech by using the partial speech segments changed by the changing means.
Preferably, in changing the power value of each of the partial speech segments, for each of the partial speech segments, a corresponding reference power value is acquired, an amplitude change magnification is calculated on the basis of the power value estimated in the estimation step and the acquired reference power value, and a change to the estimated power value is made by changing an amplitude of the partial speech segment in accordance with the calculated amplitude change magnification. More specifically, an amplitude value of the partial speech segment is changed by using, as an amplitude change magnification, s being obtained by
s
=(
p/q
)
1/2
where p is the power value estimated in the estimation step, and q is the acquired reference power value.
Preferably, in estimating the power of each partial speech segment, whether each of the partial speech segments is a voiced or unvoiced sound is determined, and if it is determined that the partial speech segment is a voiced sound, a power value is estimated by using a parameter value for a voiced speech segment, and if it is determined that the speech segment is an unvoiced sound, a power value is estimated by using a parameter value of an unvoiced speech segment. Since parameter values suited for voiced and unvoiced sounds are used, power control can be performed more properly.
Preferably, in estimating the power value of each partial speech segment, a power estimation factor for each of the partial speech segments is acquired, and a parameter value corresponding to the acquired power estimation factor is acquired in accordance with the determination result on a voiced/unvoiced sound to estimate the power value. Preferably, the power estimation factor includes one of a phoneme type of the partial speech segment, a mora position of a synthesis target word of the partial speech segment, a mora count of the synthesis target word, and an accent type.
Preferably, a power estimation factor for a voiced sound is acquired if it is determined that the partial speech segment is a voiced sound, and a power estimation factor for an unvoiced sound is acquired if it is determined that the partial speech segment is an unvoiced sound. Since differe

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speech synthesizing method and apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speech synthesizing method and apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech synthesizing method and apparatus will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3275622

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.