Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1999-09-22
2002-08-20
Dorvil, Richemond (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S268000
Reexamination Certificate
active
06438522
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of Technology
The present invention relates to a speech synthesis method and apparatus, and in particular to a speech synthesis method and apparatus whereby words, phrases or short sentences can be generated as natural-sounding synthesized speech having accurate rythm and intonation characteristics, for such applications as vehicle navigation systems, personal computers, etc.
2. Prior Art
In generating synthesized speech from input data representing a speech item such as a word, phrase or sentence, the essential requirements for obtaining natural-sounding synthesized speech are that the rythm and intonation be as close as possible to those of that speech item when spoken by a person. The rythm of an enunciated speech item, and the average speed of enunciating its syllables, are defined by the respective durations of the sequence of morae of that speech item. Although the term “morae” is generally applied only to the Japanese language, the term will be used herein in with a more general meaning, as signifying “rythm intervals”, i.e., durations for which respective syllables of speech item are enunciated.
The classification of respective sounds as “syllables” depends upon the particular language in which speech synthesis is being performed. For example, English does not have a syllable that is directly equivalent to the the Japanese syllable “N” (the syllabic nasal), which is considered to occupy one mora in spoken Japanese. Furthermore the term “accent” or “accented syllable” as used herein is to be understood as signifying, in the case of Japanese, a syllable which exhibits an abrupt drop in pitch. However in the case of English, the term “accented” is to be understood as applying to a syllable or word which is stressed. i.e. for which there is an abrupt increase in speech power. Thus although speech item examples used in the following description of embodiments of the invention are generally in Japanese, the invention is not limited in its application to that language.
One prior art system which is concerned with the problem of determining the rythm of synthesized speech is described in Japanese patent HEI 6-274195 (Japanese Language Speech Synthesis System forming Normalized Vowel Lengths and Consonant Lengths Between Vowel Center-of-Gravity Points). With that prior art system as shown in
FIG. 21
, a rule-based method is utilized, whereby the time interval between the vowel energy center-of-gravity points of the respective vowels of two mutually adjacent morae formed of a leading syllable
11
and a trailing syllable
12
is taken as being the morae interval between these syllables, and the value of that morae interval is determined by using the consonant which is located between the two morae and the pronunciation speed as parameters. The respective durations of each of the vowels of the two morae are then inferred, by using as parameters the vowel energy center-of-gravity interval and the consonant durations.
Another example of prior art systems for synthesized speech is described in Japanese patent HEI 7-261778 (Method and Apparatus for Speech Information Processing), whereby respective pitch patterns can be generated for words which are to be speech-synthesized. Such a pitch pattern defines, for each phoneme of a word, the phoneme duration and the form of variation of pitch in that phoneme. With the first embodiment of that invention, a pitch pattern is generated for a word by a process of:
(a) predeterming the respective durations of the phonemes of the word,
(b) determining the number of morae and the position of any accented syllable (i.e., the accent type) of the word,
(c) predetermining certain characteristic amounts, i.e., values such as reference values of pitch and speech power, for the word,
(d) for each vowel of the word, looking up a pitch pattern table to obtain respective values for pitch at each of a plurality of successive time points within the vowel (these pitch values for a vowel being obtained from the pitch pattern table in accordance with the number of morae of the word, the mora position of that vowel and the position of any accented syllable in the word), and
(e) within each vowel of the word, deriving interpolated values of pitch by using the set of pitch values obtained for that vowel from the pitch pattern table.
Interpolation from the vowel pitch values can also be applied to obtain the pitch values of any consonants in the word.
As shown in
FIG. 22
, that system includes a speech file
21
having stored therein a speech database of words which are expressed in a form whereby the morae number and accent type can be determined, with each word being assigned a file number. A word which is to be speech-synthesized is first supplied to a features extraction section
22
, a label attachment section
23
and a phoneme list generating section
14
. The label attachment section
23
determines the starting and ending time points for audibly generating each of the phonemes constituting the word. This operation is executed manually, or under the control of a program. The phoneme list generating section
14
determines the morae number and accent type of the word, and the information thus obtained by the label attachment section
23
and phoneme list generating section
14
, labelled with the file number of the word, are combined to form entries for the respective phonemes of the word in a table that is held in a label file
16
.
A characteristic amounts file
25
specifies such characteristic quantities as center values of fundamental frequency and speech power which are to be used for the selected word. The data which have been set into the characteristic amounts file
25
and label file
16
for the selected word are supplied to a statistical processing section
27
, which contains the aforementioned pitch pattern table. The aforementioned respective sets of frequency values for each vowel of the word are thereby obtained from the pitch pattern table, in accordance with the environmental conditions (number of morae in word, mora position of that vowel, accent type of the word) affecting that vowel, and are supplied to a pitch pattern generating section
28
. The pitch pattern generating section
28
executes the aforementioned interpolative processing to obtain the requisite pitch pattern for the word.
FIG. 23
graphically illustrates a pitch pattern which might be derived by the system of
FIG. 22
, for the case of a word “azi”. The respective durations which have been determined for the three phonemes of this word are indicated as L
1
, L
2
, L
3
, and it is assumed that three pitch values are obtained by the statistical processing section
27
for each vowel, these being indicated as f
1
, f
2
, f
3
for the leading vowel “a”, with all other pitch values being derived by interpolation.
It will be apparent that it is necessary to derive the sets of values to be utilized in the pitch pattern table of the statistical processing section
27
by statistical analysis of large amounts of speech patterns, and the need to process such large amounts of data in order to obtain sufficient accuracy of results is a disadvantage of this method. Furthermore, although the resultant information will specify average forms of pitch variation, such an average form of pitch variation may not necessarily correspond to the actual intonation of a specific word in natural speech.
With the prior art method of
FIG. 21
on the other hand, the rythm of the resultant synthesized speech, i.e., the rythm within a word or sentence, is determined only on the basis of assumed timing relationships between each of respective pairs of adjacent morae, irrespective of the actual rythm which the word or sentence would have in natural speech. Hence it will be impossible to generate synthesized speech having a rythm which is close to that of natural speech.
There is therefore a requirement for a speech synthesis system whereby the resultant synthesized speech is substantially close to natural speech in its rythm and intonation characteristics, but whic
Minowa Toshimitsu
Mochizuki Ryo
Nishimura Hirofumi
Armstrong Angela
Dorvil Richemond
Lowe Hauptman & Gilman & Berner LLP
Matsushita Electric - Industrial Co., Ltd.
LandOfFree
METHOD AND APPARATUS FOR SPEECH SYNTHESIS WHEREBY WAVEFORM... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with METHOD AND APPARATUS FOR SPEECH SYNTHESIS WHEREBY WAVEFORM..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and METHOD AND APPARATUS FOR SPEECH SYNTHESIS WHEREBY WAVEFORM... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2957733