Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
2001-01-03
2003-09-23
{haeck over (S)}mits, T{overscore (a)}livaldis Ivars (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
Reexamination Certificate
active
06625575
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to text-to-speech conversion technology, more particularly to a method of intonation control in synthesized speech.
Text-to-speech conversion is a technology that converts ordinary text, of the type that people read every day, to spoken words, and outputs a speech signal. Because of its unlimited output vocabulary, this technology has potential uses in many fields, as a replacement for pre-recorded speech synthesis.
A typical speech synthesis system of the text-to-speech type has the structure shown in FIG.
1
. The input is a machine-readable form of ordinary text. A text analyzer
101
analyzes the input text and generates a sequence of phonetic and prosodic symbols that use predefined character strings (referred to below as an intermediate language) to indicate pronunciation, accent, intonation, and other information. Incidentally, the illustrated system processes Japanese text, and the accent referred to herein is a pitch accent.
To generate the intermediate-language representation, the text analyzer
101
carries out linguistic processing such as morphemic analysis and semantic analysis, referring to a word dictionary
104
that gives the pronunciation, accent, and other information about each word. The resulting intermediate-language representation is processed by a parameter generator
102
to determine various synthesis parameters. These parameters from patterns of speech elements (sound types), phonation times (sound durations), phonation power (intensity of sound), fundamental frequency (voice pitch), and the like. The synthesis parameters are sent to a waveform generator
103
, which generates synthesized speech waveforms by referring to a speech-element dictionary
105
. The speech-element dictionary
105
is, for example, a read-only memory (ROM) storing speech elements and other information. The stored speech elements are the basic units of speech from which waveforms are synthesized. There are many types of speech elements, corresponding to different sounds, for example. The synthesized waveforms are reproduced through a loudspeaker and heard as synthesized speech.
The internal structure of the parameter generator
102
is shown in FIG.
2
. The input intermediate language representation comprises phonetic character sequences accompanied by prosodic information such as accent position, positions of pauses, and so on. The parameters determined from this information include the time variations in pitch (referred to below as the pitch pattern), phonation power, the phonation time of each phoneme, the addresses of speech elements stored in the speech-element dictionary, and other parameters (referred to below as synthesis parameters) needed for synthesizing speech waveforms.
In the parameter generator
102
, an intermediate language analyzer (ILA)
201
analyzes the input intermediate language, identifies word boundaries from word-delimiting symbols and breath-group symbols, and analyzes the accent symbols to find the moraic position of the accent nucleus of each word. A breath group is a unit of text that is spoken in one breath. A mora, in Japanese, is a short syllable or part of a long syllable. A voiced mora includes one vowel phoneme or the nasal
/ phoneme. The accent nucleus, in Japanese, is the position where the pitch drops sharply. A word with an accent nucleus in the first mora is said to have a type-one accent. A word with an accent nucleus in the n-th mora is said to have a type-n accent (n being an integer greater than one), and these words are said to have a rising-and-falling accent. Words with no accent nucleus are said to have a type-zero accent or a flat accent; examples include the Japanese words ‘shimbun’ (newspaper) and ‘pasokon’ (personal computer).
A pitch pattern generator
202
calculates the pitch frequency of each voiced mora from the prosodic information in the intermediate language. In conventional Japanese text-to-speech conversion, pitch patterns are controlled by estimating the pitch frequency at the center of the vowel (or nasal
/) in the mora, and using linear interpolation or spline interpolation between these positions; this technique is referred to as point-pitch modeling. Central vowel pitches are estimated by well-known statistical techniques such as Chikio Hayashi's first quantification method. Control factors include, for example, the accent type of the word to which the vowel belongs, the position of the mora relative to the start of the word, the position of the mora within the breath group, and the phonemic type of the mora. The collection of estimated vowel-centered pitches will be referred to below as the point pitch pattern, while the entire pattern generated by interpolation will be referred to simply as the pitch pattern. The pitch pattern is calculated on the basis of the phonation time of each phoneme as determined by a phonation time generator
203
, described below. If the user has specified a desired intonation level or a desired voice pitch, corresponding processing is carried out. Voice pitch is typically specifiable on about five to ten levels, for each of which a predetermined constant is added to the calculated pitch values. Intonation is typically specifiable on three to five levels, for each of which the calculated pitch values are partly multiplied by a predetermined constant. These control features are provided to enable specific words in a sentence to be emphasized or de-emphasized. Further information will be given later, as these are the features with which the present invention is concerned.
The phonation time generator
203
determines the length of each phoneme from the phonetic character sequences and prosodic symbols. Common methods of determining the phonation time include statistical techniques such as the above-mentioned quantification method, using the preceding and following phoneme types, or moraic position within the word or breath group. If the user has specified a desired speech speed, the phonation times are expanded or contracted accordingly. Speech speed can typically by specified on about five to ten levels; the calculated phonation times are multiplied by a predetermined constant for each level. Specifically, the phonation times are lengthened to slow down the speech, and shortened to speed up the speech.
A phonation power generator
204
calculates the amplitude of the waveform of each phoneme from the phonetic character sequences. The waveform amplitude values are determined empirically from factors such as the phoneme type (/a, e, i, o, u/, for example) and moraic position in the breath group. The phonation power generator
204
also determines the power transitions within each mora: the initial interval in which the amplitude value gradually increases, the steady-state interval that follows, and the final interval in which the amplitude value gradually decreases. Tables of numerical values are usually used to carry out this power control. If the user has specified a desired voice volume level, the amplitude values are increased or decreased accordingly. Voice volume can typically be specified on about ten levels. The amplitude values are multiplied by a predetermined constant for each level.
A speech element selector
205
determines the addresses in the speech-element dictionary
105
of the speech elements needed for expressing the phonetic character sequences. The speech elements stored in the speech-element dictionary
105
include elements derived from several types of voices, normally including at least one male voice and at least one female voice. The user specifies a desired voice type, and the speech element addresses are determined accordingly.
The pitch pattern, phonation powers, phonation times, and speech element addresses determined as described above are supplied to a synthesis parameter generator (SPG)
206
, which generates the synthesis parameters. The synthesis parameters describe waveform frames with a typical length of about eight milliseconds (8 ms). The synthesis parameters are sent to the
Oki Electric Industry Co. Ltd.
Rabin & Berdo P.C.
{haeck over (S)}mits T{overscore (a)}livaldis Ivars
LandOfFree
Intonation control method for text-to-speech conversion does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Intonation control method for text-to-speech conversion, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Intonation control method for text-to-speech conversion will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3059496