Method for prosody generation by unit selection from an...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S260000, C704S266000

Reexamination Certificate

active

06829581

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to a process of producing natural sounding speech converted from text, and more particularly, to a method of prosody generation by unit selection from an imitation speech database.
BACKGROUND AND SUMMARY OF THE INVENTION
Text to speech (TTS) conversion systems have achieved consistent quality prosody using rule based prosody generation systems. For purposes of this application, rule based systems are systems that rely on human analysis to extract explicit rules to generate the prosody for different cases. Alternatively, corpus based prosody generation methods automatically extract the requested data from a given labeled database. The rule based synthesizer systems have achieved a high level of intelligibility, although their unnatural prosody and synthetic voice quality prevent them from being widely used in communication systems. Natural prosody is one of the more important requirements for high quality speech synthesis, to which users can listen comfortably. In addition, the ability to personalize the prosody of a synthetic voice to that of a certain speaker can be useful for many applications.
Recently, corpus based prosody modeling and generation methods have been shown to be able to produce natural-sounding prosody for text to speech systems. On the other hand, rule based prosody generation systems have the advantage of giving consistent quality prosody. Compared with the corpus based methods, the rule based method allows a conveniently explicit way of handling various prosodic effects that are not currently optimized in corpus based modeling and generation methods.
The present invention provides a method to combine the robustness of the rule based method of text to speech generation with a more natural and speaker adaptive corpus based method. The rule based method produces a set of intonation events by selecting syllables on which there would be either a pitch peak or dip (or a combination), and produces the parameters which originally would be used to generate a final shape of the event. The synthetic shape generated by the rule based method is then utilized to select the best matching units from an imitation speech database of a speaker's prosody, which are then concatenated to reduce the final prosody.
The database of the speaker's prosody is created by having the target speaker listen to a set of speech-synthesized sentences, and then imitate their prosody, while trying to still sound natural. The imitation speech is time aligned with the synthetic speech, and the time alignment is used to project the intonation events onto the imitation speech, thus avoiding the work intensive process of labeling the imitation speech database. After this processing, a database is formed of prosody events and their parameters. By using imitation speech, it is possible to reduce unwanted inconsistency and variability in the speaker's prosody, which otherwise can degrade the generated prosody. For prosody generation, a dynamic programming method is used to select a sequence of prosody events from the database, so as to be both close to the target event sequence, and as to connect to each other smoothly and naturally. The selected events are smoothly concatenated, and their intonation and duration is copied into the syllables and phonemes comprising the new sentence. The method can be used to easily and quickly personalize the prosody generation to that of a target speaker.
Further areas of applicability of the present invention will become apparent from the detailed description provided hereinafter. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.


REFERENCES:
patent: 6101470 (2000-08-01), Eide et al.
patent: 6266637 (2001-07-01), Donovan et al.
patent: 6665641 (2003-12-01), Coorman et al.
patent: 6684187 (2004-01-01), Conkie
patent: 6697780 (2004-02-01), Beutnagel et al.
patent: 6701295 (2004-03-01), Beutnagel et al.
“Generating Fo contours from ToBI labels using linear regression”, A. Black and A. Hunt; ATR Interpreting Telecommunications Laboratories.
“Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database”, A. Hunt and A. Black; ATR Interpreting Telecommunications Research Labls (1996) IEEE, pp. 373-376.
“Using Decision Trees With the Tilt Intonation model to Predict Fo Contours”, K. Dusterhoff, A. Black, and P. Taylor; Centre for Speech Technology Research.
“Speech Synthesis by Phonological Structure Matching”, Paul Taylor and Alan W. Black; Centre for Speech Technology Research.
“Recent Improvements on Microsofts Trainable Text-to-Speech System—Whistler”, X. Huang, A. Acero, H. Hon., Y. Ju, J. Liu, S. Meredith, M. Plumpe; Microsoft Research (1997) IEEE, pp. 959-962.
“Three Method of Intonation Modeling”, A. Syrdal, G. Mohler, K. Dusterhoff, A. Conkie, A. Black; AT&T Labs.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for prosody generation by unit selection from an... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for prosody generation by unit selection from an..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for prosody generation by unit selection from an... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3291833

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.