Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1999-03-15
2001-02-06
Ŝmits, T{overscore (a)}livaldis I. (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S224000, C704S258000, C704S264000, C704S211000
Reexamination Certificate
active
06185533
ABSTRACT:
BACKGROUND AND SUMMARY OF THE INVENTION
The present invention relates generally to text-to-speech (tts) systems and speech synthesis. More particularly, the invention relates to a system for generating duration templates which can be used in a text-to-speech system to provide more natural sounding speech synthesis.
The task of generating natural human-sounding prosody for text-to-speech and speech synthesis has historically been one of the most challenging problems that researchers and developers have had to face. Text-to-speech systems have in general become infamous for their unnatural prosody such as “robotic” intonations or incorrect sentence rhythm and timing. To address this problem some prior systems have used neural networks and vector clustering algorithms in an attempt to simulate natural sounding prosody. Aside from being only marginally successful, these “black box” computational techniques give the developer no feedback regarding what the crucial parameters are for natural sounding prosody.
The present invention builds upon a different approach which was disclosed in a prior patent application entitled “Speech Synthesis Employing Prosody Templates”. In the disclosed approach, samples of actual human speech are used to develop prosody templates. The templates define a relationship between syllabic stress patterns and certain prosodic variables such as intonation (F
0
) and duration, especially focusing on F
0
templates. Thus, unlike prior algorithmic approaches, the disclosed approach uses naturally occurring lexical and acoustic attributes (e.g., stress pattern, number of syllables, intonation, duration) that can be directly observed and understood by the researcher or developer.
The previously disclosed approach stores the prosody templates for intonation (F
0
) and duration information in a database that is accessed by specifying the number of syllables and stress pattern associated with a given word. A word dictionary is provided to supply the system with the requisite information concerning number of syllables and stress patterns. The text processor generates phonemic representations of input words, using the word dictionary to identify the stress pattern of the input words. A prosody module then accesses the database of templates, using the number of syllables and stress pattern information to access the database. A prosody template for the given word is then obtained from the database and used to supply prosody information to the sound generation module that generates synthesized speech based on the phonemic representation and the prosody information.
The previously disclosed approach focuses on speech at the word level. Words are subdivided into syllables and thus represent the basic unit of prosody. The stress pattern defined by the syllables determines the most perceptually important characteristics of both intonation (F
0
) and duration. At this level of granularity, the template set is quite small in size and easily implemented in text-to-speech and speech synthesis systems. While a word level prosodic analysis using syllables is presently preferred, the prosody template techniques of the invention can be used in systems exhibiting other levels of granularity. For example, the template set can be expanded to allow for more grouping features, both at the sentence and word level. In this regard, duration modification (e.g. lengthening) caused by phrase or sentence position and type, segmental structure in a syllable, and phonetic representation can be used as attributes with which to categorize certain prosodic patterns.
Although text-to-speech systems based upon prosody templates that are derived from samples of actual human speech have held out the promise of greatly improved speech synthesis, those systems have been limited by the difficulty of constructing suitable duration templates. To obtain temporal prosody patterns the purely segmental timing quantities must be factored out from the larger scale prosodic effects. This has proven to be much more difficult than constructing F
0
templates, wherein intonation information can be obtained by visually examining individual F
0
data.
The present invention presents a method of separating high-level prosodic behavior from purely articulatory constraints so that high-level timing information can be extracted from human speech. The extracted timing information is used to construct duration templates that are employed for speech synthesis. Initially, the words of input text are segmented into phonemes and syllables and the associated stress pattern is assigned. The stress assigned words can then be assigned grouping features by a text grouping module. A phoneme cluster module groups the phonemes into phoneme pairs and single phonemes. A static duration associated with each phoneme pair and single phoneme is retrieved from a global static table. A normalization module generates a normalized duration value for a syllable based upon lengthening or shortening of the global static durations associated with the phonemes that comprise the syllable. The normalized duration value is stored in a duration template based upon the grouping features associated with that syllable.
For a more complete understanding of the invention, its objectives and advantages, refer to the following specification and to the accompanying drawings.
REFERENCES:
patent: 5230037 (1993-07-01), Giustiniani et al.
patent: 5278943 (1994-01-01), Gasper et al.
patent: 5384893 (1995-01-01), Hutchins
patent: 5592585 (1997-01-01), Van Coile et al.
patent: 5636325 (1997-06-01), Farrett
patent: 5642520 (1997-06-01), Takeshita et al.
patent: 5652828 (1997-07-01), Silverman
patent: 5696879 (1997-12-01), Cline et al.
patent: 5704009 (1997-12-01), Cline et al.
patent: 5727120 (1998-03-01), Van Coile et al.
patent: 5729694 (1998-03-01), Holzrichter et al.
patent: 5732395 (1998-03-01), Silverman
patent: 5749071 (1998-05-01), Silverman
patent: 5751906 (1998-05-01), Silverman
patent: 5796916 (1998-08-01), Meredith
patent: 5828994 (1998-10-01), Covell et al.
patent: 6029131 (2000-02-01), Bruckert
Bailly (G. Bailly, “Integration of Rhythmic and Syntactic Constraints in a Model of Generation of French Prosody,” Elsevier Science Publishers, Jun. 1989).
Campbell, W. N., “Syllable-based Segmental Duration”, pp. 211-224, (Undated),Talking Machines: Theories, Models, and Designs, copyright 1992, Elsevier Science Publishers B.V.
Hata Kazue
Holm Frode
Ŝmits T{overscore (a)}livaldis I.
Harness & Dickey & Pierce P.L.C.
Matsushita Electric - Industrial Co., Ltd.
Nolan Daniel A.
LandOfFree
Generation and synthesis of prosody templates does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Generation and synthesis of prosody templates, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Generation and synthesis of prosody templates will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2558877