Coarticulation method for audio-visual text-to-speech synthesis

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S263000, C704S276000

Reexamination Certificate

active

06662161

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to the field of photo-realistic imaging. More particularly, the invention relates to a method for generating talking heads in a text-to-speech synthesis application which provides for realistic-looking coarticulation effects.
Visual TTS, the integration of a “talking head” into a text-to-speech (“TTS”) synthesis system, can be used or a variety of applications. Such applications include, for example, model-based image compression for video telephony, presentations, avatars in virtual meeting rooms, intelligent computer-user interfaces such as E-mail reading and games, and many other operations. An example of an intelligent user interface is an E-mail tool on a personal computer which uses a talking head to express transmitted E-mail messages. The sender of the E-mail message could annotate the E-mail message by including emotional cues with or without text. Thus, a boss wishing to send a congratulatory E-mail message to a productive employee can transmit the message in the form of a happy face. Different emotions such as anger, sadness, or disappointment can also be emulated.
To achieve the desired effect, the animated head must be believable. That is, it must look real to the observer. Both the photographic aspect of the face (natural skin appearance, realistic shapes, absence of rendering artifacts) and the lifelike quality of the animation (realistic head and lip movements in synchrony with sound) must be perfect, because humans are extremely sensitive to the appearance and movement of a face.
Effective visual TTS can grab the attention of the observer, providing a personal user experience and a sense of realism to which the user can relate. Visual TTS using photorealistic talking heads, the subject of the present invention, has numerous benefits, including increased intelligibility over other methods such as cartoon animation, increased quality of the voice portion of the TTS system, and a more personal user interface.
Various approaches exist for realizing audio-visual TTS synthesis algorithms. Simple animation or cartoons are sometimes used. Generally, the more meticulously detailed the animation, the greater its impact on the observer. Nevertheless, because of their artificial look, cartoons have a limited effect. Another approach for realizing TTS methods involves the use of video recordings of a talking person. These recordings are integrated into a computer program. The video approach looks more realistic than the use of cartoons. However, the utility of the video approach is limited to situations where all of the spoken text is known in advance and where sufficient storage space exists in memory for the video clips. These situations simply do not exist in the context of the more commonly employed TTS applications.
Three-dimensional modeling can also be used for many TTS applications. These models provide considerable flexibility because they can be altered in any number of ways to accommodate the expression of different speech and emotions. Unfortunately, these models are usually not suitable for automatic realization by a computer. The complexities of three-dimensional modeling are ever-increasing as present models are continually enhanced to accommodate a greater degree of realism. Over the last twenty years, the number of polygons in state-of-the-art three-dimensional synthesized scenes has grown exponentially. Escalated memory requirements and increased computer processing times are unavoidable consequences of these enhancements. To make matters worse, synthetic scenes generated from the most modern three-dimensional modeling techniques often still have an artificial look.
With a view toward decreasing memory requirements and computation times while preserving realistic images in TTS methodologies, practitioners have implemented various sample-based photorealistic techniques. These approaches generally involve storing whole frames containing pictures of the subject, which are recalled in the necessary sequence to form the synthesis. While this technique is simple and fast, is too limited in versatility. That is, where the method relies on a limited number of stored frames to maintain compatibility with the finite memory capability of the computer being used, this approach cannot accommodate sufficient variations in head and facial characteristics to promote a believable photorealistic subject. The number of possible frames for this sample-based technique is consequently too limited to achieve a highly realistic appearance for most conventional computer applications.
FIG. 1
is a chart illustrating the various approaches used in TTS synthesis methodologies. The chart shows the tradeoff between realism and flexibility as a function of different approaches. The perfect model (block
130
) would have complete flexibility because it could accommodate any speech or emotional cues whether or not known in advance. Likewise, the perfect model would look completely realistic, just like a movie screen. Not surprisingly, there are no perfect models.
As can be seen, cartoons (block
100
) demonstrate the least amount of flexibility, since the cartoon frames are all predetermined, and as such, the speech to be tracked must be known in advance. Cartoons are also the most artificial, and hence the least realistic-looking. Movies (block
110
) or video sequences provide for a high degree of realism. However, like cartoons, movies have little flexibility since their frames depend upon a predetermined knowledge of the text to be spoken. The use of three-dimensional modeling (block
120
) is highly flexible, since it is fully synthetic and can accommodate any facial appearance and can be shown from any perspective (unlike models which rely on two dimensions). However, because of its synthetic nature, three-dimensional modeling still looks artificial and consequently scores lower on the realism axis.
Sample-based techniques (block
140
) represent the optimal tradeoff, with a substantial amount of realism and also some flexibility. These techniques look realistic because facial movements, shapes, and colors can be approximated with a high degree of accuracy and because video images of live subjects can be used to create the sample-based models. Sample based techniques are also flexible because a sufficient amount of samples can be taken to exchange head and facial parts to accommodate a wide variety of speech and emotions. By the same token, these techniques are not perfectly flexible because memory considerations and computation times must be taken into account, which places practical limits on the number of samples used (and hence the appearance of precision) in a given application.
To date, no animation technique exists for generating lifelike characters that could be automatically realized by a computer and that would be perceived by an observer as completely natural. Practitioners who have nevertheless sought to approximate such techniques have met with some success. Where practitioners employ a limited range of views and actions in a sample-based TTS synthesis (thereby minimizing memory requirements and computation times), photorealistic synthesis is coming within reach of today's technology. For example, the practitioner may implement a method which relies on frontal views of the head and shoulders, with limited head movements of 30 degree rotations and modest translations. While such a method has a limited versatility, often applications exist which do not require greater capability (e.g., some computer-user interface applications). Limited photorealistic synthesis methods can be a viable alternative for such applications.
Sample-based methods for generating photo-realistic characters are described in currently-pending patent applications entitled “Multi-Modal System For Locating Objects In Images”, Graf et al. U.S. patent application Ser. No. 08/752109, filed Nov. 20, 1996, and “Method For Generating Photo-realistic Animated Characters”, Graf et al. U.S. patent application Ser. No. 08/869531, filed Jun. 6, 1997

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Coarticulation method for audio-visual text-to-speech synthesis does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Coarticulation method for audio-visual text-to-speech synthesis, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Coarticulation method for audio-visual text-to-speech synthesis will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3183134

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.