Speech synthesis apparatus

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S267000, C704S268000

Reexamination Certificate

active

06499014

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech synthesis apparatus that synthesizes a given speech by rules, in particular to a speech synthesis apparatus in which control of pitch contour of synthesized speech is improved in a text-to-speech conversion technique that outputs a mixed sentence including Chinese characters (called Kanji) and Japanese syllabary (Kana) used in our daily reading and writing, as the speech.
2. Description of the Related Art
According to the text-to-speech conversion technique, Kanji and Kana characters used in our daily reading and writing are input and converted into speech in order to be output. This technique has no limitation on the vocabulary to be output. Thus, the text-to-speech conversion technique is expected to be applied to various technical fields as an alternative technique to recording-reproducing speech synthesis.
When Kanji and Kana characters (hereinafter, referred to as a text) are input to a conventional speech synthesis apparatus, a text analysis module included therein generates a string of phonetic and prosodic symbols (hereinafter, referred to as an intermediate language) from the character information. The intermediate language describes how to read the input sentence, accents, intonation and the like as a character string. A prosody generation module then determines synthesizing parameters from the intermediate language generated by the text analysis module. The synthesizing parameters include a pattern of a phoneme, a duration of the phoneme and a fundamental frequency (pitch of voice, hereinafter simply referred to as pitch) and the like. The determined synthesizing parameters are output to a speech generation module. The speech generation module generates a synthesized waveform generated in the prosody generation module and a voice segment dictionary in which phonemes are accumulated, and then outputs synthetic sound through a speaker.
Next, a conventional process conducted by the prosody generation module is described in detail. The conventional prosody generation module includes an intermediate language analysis module, a phrase command determination module, an accent command determination module, a phoneme duration calculation module, a phoneme power determination module and a pitch contour generation module.
The intermediate language input to the prosody generation module is a string of phonetic characters with the position of an accent, the position of a pause or the like. From this string, parameters required for generating a waveform (hereinafter, referred to as waveform-generating parameters), such as time-variant change of the pitch (hereinafter, referred to as a pitch contour), the duration of each phoneme (hereinafter, referred to as the phoneme duration), and power of speech are determined. The intermediate language input is subjected to analysis of the character string in the intermediate language analysis module. In the analysis, word-boundaries are determined based on a symbol indicating a word's end in the intermediate language, and a mora position of an accent nucleus is obtained based on an accent symbol.
The accent nucleus is a position at which the accent falls. A word having an accent nucleus positioned at the first mora is referred to as a word of accent type one while a word having an accent nucleus positioned at the n-th mora is referred to as a word of accent type n. These words are referred to an accented word. On the other hand, a word having no accent nucleus (for example, “shin-bun” and “pasokon”, which mean a newspaper and a personal computer in Japanese, respectively) are referred to as a word of accent type zero or an unaccented word.
The phrase command determination module and the accent command determination module determine parameters for response functions described later, based on a phrase symbol, an accent symbol and the like in the intermediate language. In addition, if a user sets intonation (the magnitude of the intonation), the magnitude of the phrase command and that of the accent command are modified in accordance with the user's setting.
The phoneme duration calculation module determines the duration of each phoneme from the phonetic character string and sends the calculation result to the speech generation module. The phoneme duration is calculated using rules or a statistical analysis such as Quantification theory (type one), depending on the type of an adjacent phoneme. Quantification theory (type one) is a kind of factor analysis, and it can formulate the relationship between categorical and numerical values. In addition, in the case where the user sets a speech rate, the phoneme duration determination module is influenced by the speech rate. Normally, the phoneme duration becomes longer when the speech rate is made slower, while the phoneme duration becomes shorter when the speech rate is made faster.
The phoneme power determination module calculates the value of the amplitude of the waveform in order to send the calculated value to the speech generation module. The phoneme power is a power transition in a period corresponding to a rising portion of the phoneme in which the amplitude gradually increases, in a period corresponding to a steady state, and in a period corresponding to a falling portion of the phoneme in which the amplitude gradually decreases, and is calculated based on coefficient values in the form of a table.
These waveform generating parameters are sent to the speech generation module. Then, the synthesized waveform is generated.
Next, a procedure for generating a pitch contour in the pitch contour generation module is described.
FIG. 14
is a diagram explaining the generation procedure of the pitch contour and illustrates a model of a pitch control mechanism.
In order to sufficiently represent differences of intonation between various sentences, it is necessary to clarify the relationship between pitch and time in a syllable. The “pitch control mechanism model” described by a critical damping second-order linear system is used as a model that can clearly describe the pitch contour in the syllable and can define the time-variant structure of the syllable. The pitch control mechanism model described in the present specification is the model explained below.
The pitch control mechanism model is a model that is considered to generate a fundamental frequency providing information about the voice pitch. The frequency of vibration of vocal cords, that is, the find a mental frequency, is controlled by an impulse command generated at every change of phrase, and a stepwise command generated at every rising and falling of an accent. Because of delay characteristics of physiological mechanisms, the impulse command of the phrase is a curve (phrase component) gradually descending from the front of a sentence to the end of the sentence, (see the waveform indicated with a broken line in FIG.
14
), while the stepwise command of the accent is a curve (accent component) with local ups and downs, (indicated by a waveform with a solid line in FIG.
14
). Each of these two components are modeled as a response of the critical damping second-order linear system of the corresponding command. The pattern of the time-variant change of the logarithmic fundamental frequency is expressed as a sum of these two components.
The logarithmic fundamental frequency F
0
(t) (t: time) is formulated as shown by Expression (1).
Ln



F0

(
t
)
=


ln



Fmin
+

i
=
1
I

Api



Gpi

(
t
-
T0i
)
+



j
=
1
J

Aaj

{
Gaj

(
t
-
T1j
)
-
Gaj

(
t
-
T2j
)
}
(
1
)
In Expression (1), Fmin is the lowest frequency (hereinafter, referred to as a base pitch), I is the number of phrase commands in the sentence, Api is the magnitude of the i-th phrase command in the sentence, T
0
i is a start time of the i-th phrase command in the sentence, J is the number of accent commands in the sentence, Aaj is the magnitude of the j-th accent command in the sent

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speech synthesis apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speech synthesis apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech synthesis apparatus will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2920924

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.