Speech synthesis apparatus having prosody generator with...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S260000

Reexamination Certificate

active

06470316

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech synthesis apparatus that synthesizes a given speech based on rules, in particular to a speech synthesis apparatus in which control of the duration of a phoneme when a vowel is devoiced is improved using a text-to-speech conversion technique that outputs as speech a mixed sentence including Chinese characters (called Kanji) and Japanese syllabary (Kana) used in our daily reading and writing.
2. Description of the Related Art
According to the text-to-speech conversion technique, Kanji and Kana characters used in our daily reading and writing are input and are then converted into speech to be output. Using this technique, there is no limitation on vocabulary to be output. Thus, the text-to-speech conversion technique is expected to be applied to various technical fields as an alternative technique to recording-reproducing speech synthesis.
When Kanji and Kana characters used in our daily reading and writing are input to a conventional speech synthesis apparatus, a text analysis module included therein generates a string of phonetic and prosodic symbols (hereinafter, referred to as an intermediate language) from the character information. The intermediate language describes how to read the input sentence, accents, intonation and the like as a character string. A prosody generation module then determines synthesizing parameters from the intermediate language generated by the text analysis module. The synthesizing parameters include the pattern of phoneme, the duration of the phoneme and the fundamental frequency (pitch of voice, hereinafter simply referred to as pitch) and the like. The synthesizing parameters determined are output to a speech generation module. The speech generation module generates a synthesized waveform by referring to the various synthesizing parameters generated in the prosody generation module and a voice segment dictionary in which phonemes are stored, and then outputs synthesized sound through a speaker.
Next, a conventional process conducted by the prosody generation module is described in detail. The conventional prosody generation module includes an intermediate language analysis module, a pitch contour generation module, a devoicing determination module, a phoneme power determination module, a phoneme duration calculation module and a duration modification module.
The intermediate language input to the prosody generation module is a string of phonetic characters with the position of an accent, the position of a pause or the like indicated. From this string, parameters (hereinafter, referred to as a pitch pattern) required for generating a waveform such as time-variant change of the pitch, duration of each phoneme (hereinafter, referred to as a phoneme duration), and a power of speech (hereinafter, referred to as waveform-generating parameters), are determined. The intermediate language input is subjected to analysis of the character string in the intermediate language analysis module. In the analysis, a word-boundary is determined based on a symbol indicating a word's end in the intermediate language, and a mora position of an accent nucleus is obtained based on an accent symbol.
The accent nucleus is a position at which the accent falls. A word having an accent nucleus at the first mora is referred to as a word of accent type one while a word having an accent nucleus at the n-th mora is referred to as a word of accent type n. These words are referred to an accented word. On the other hand, a word having no accent nucleus (for example, “shin-bun” and “pasokon”, which mean a newspaper and a personal computer in Japanese, respectively) is referred to as a word of accent type zero or an unaccented word.
The pitch contour generation module determines a parameter for each response function based on a phrase symbol, the accent symbol and the like described in the intermediate language. In addition, if the intonation (the magnitude of the intonation) or an entire voice pitch is set by a user, the pitch contour generation module modifies the magnitude of a phrase command and/or that of an accent command in accordance with the user's setting.
The devoicing determination module determines whether or not a vowel is to be devoiced based on a phonetic symbol and the accent symbol in the intermediate language. The vowel devoicing determination module then sends the determination result to the phoneme power determination module and the phoneme duration calculation module. Devoicing the vowel will be described in detail later.
The phoneme duration calculation module calculates the duration of each phoneme from the phonetic character string and sends the calculation result to the duration modification module. The phoneme duration is calculated by using rules or a statistical analysis such as Quantification theory (type one), depending on the type of the adjacent phoneme. In a case where the user sets a speech rate, the duration modification module linearly stretches or shrinks the phoneme duration depending on the set speech rate. However, please note that such stretching or shrinking is normally performed only for the vowel.
The phoneme duration stretched or shrunk depending on the speech rate by the duration modification module is sent to the speech generation module.
The phoneme power determination module calculates the amplitude value of the waveform in order to send the calculated value to the speech generation module. The phoneme power is a power transition in a period corresponding to a rising portion of the phoneme in which the amplitude gradually increases, in a period corresponding to a steady state, and in a period corresponding to a falling portion of the phoneme in which the amplitude gradually decreases. The phoneme power is calculated from coefficient values in the form of a table.
The waveform generating parameters described above are sent to the speech generation module which generates the synthesized waveform.
Next, devoicing the vowel is described in detail.
When a person utters a word, air pushed out of the lungs is used as a sound source by creating an opening and closing movement of the vocal cords. Changes in resonance characteristics of the vocal tract occur by moving the chin, the tongue and lips in order to represent various phonemes. The pitch corresponds to the period of vibration of the vocal cords and thereafter a change of the pitch expresses the accents and the intonation. In addition to sounds generated by the vibration of the vocal cords, there are other types of sounds. A fricative, that is, a sound like noise, is generated by turbulence caused when air passes through a narrow space formed by a portion of the vocal tract and the tongue. Moreover, a plosive is generated by blocking the vocal tract with the tongue or the lips to temporarily stop the airflow and then releasing the airflow so as to generate an impulse-like sound.
The phonemes accompanied by the vibration of the vocal cords, that are the vowels, plosives “/b, d, g/”, fricatives “/j, z/”, nasal consonants and liquids such as “/m, n, r/”, are referred to as voiced sounds while the phonemes accompanied by no vibration of the vocal cords, that are plosives “/p, t, k/”, fricatives “/s, h, f/”, for example, are referred to as voiceless sounds. In particular, consonants are classified into voiced consonants accompanied by the vibration of the vocal cords or voiceless consonants without the vibration of the vocal cords. In the case of a voiced sound, a periodical waveform is generated by the vibration of the vocal cords. On the other hand, a noise-like waveform is generated in the case of a voiceless sound.
In common language, when the word “kiku” (that is, the Japanese word meaning chrysanthemum) is naturally uttered, for example, the first vowel “i” in the word “kiku” is uttered using only breath without vibrating the vocal cords. This is a devoiced vowel.
In the text-to-speech conversion system, it is necessary to express a vowel by devoicing it in order to improve the qual

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speech synthesis apparatus having prosody generator with... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speech synthesis apparatus having prosody generator with..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech synthesis apparatus having prosody generator with... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2928882

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.