Methods and devices for producing and using synthetic visual...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S235000, C704S258000, C345S423000

Reexamination Certificate

active

06539354

ABSTRACT:

BACKGROUND OF THE INVENTION
This invention relates generally to computer generated synthetic visual speech, otherwise known as facial animation or lipsyncing. More specifically, this invention relates to methods and devices for generating synthetic visual speech based on coarticulation. This invention further relates to methods of using synthetic visual speech.
The natural production of human speech includes both auditory and visual components. The basic unit of sound for audible speech is the phoneme. Phonemes are the smallest unit of speech capable of independent auditory recognition. Similarly, visual speech is made up of visemes. Visemes are the visible corollary to phonemes. More specifically, a viseme is a visual speech representation defined by the external appearance of articulators (i.e., lips, tongue, teeth, etc.) during articulation of a corresponding phoneme. More than one phoneme may be associated with a single viseme, because many phonemes appear the same visually. Therefore, phonemes have a many to one relationship with visemes. Phonemes and visemes form the fundamental building blocks of visual speech synthesis.
Several conventional lipsyncing systems are available which produce synthetic visual speech in a variety of different ways. For example, some of these systems use a binary (on/off) method to move between visemes. In the binary method, the image of a first viseme appears until it is switched abruptly to the image of a second viseme. In the binary approach, therefore, there is no transitioning between visemes, a viseme is either completely visible or not at all visible at a given time. When visually depicting a sound moving from an /o/ to a /t/, as in the word “hot,” for instance, the binary method displays the viseme corresponding to the /o/ until it abruptly changes to the viseme associated with the /t/. The result is very unrealistic, cartoon-like lipsyncing. An additional drawback of conventional binary systems is that they are generally limited to having only a few visemes to represent all of the possible sounds.
A better prior art approach to visual speech synthesis uses inbetweening (linear-type morphing) to transition between visemes. Morphing is a common technique for driving a 3D animation in which key frames are used to define particular configurations of a 3D model at given points in time. Morphing specifically refers to the process of interpolating between defined key frames over time to gradually transform one shape into another shape. Conventional lipsyncing systems sometimes use inbetweening (or linear interpolation based morphing) to approximate the contributions of multiple visemes to the overall appearance of the articulators at a given point in time during a viseme transition. These systems, therefore, more gradually transition between visemes by linearly combining the visemes together during the transition period. Despite the improvements that inbetweening offers over binary systems, it is still fairly unrealistic and does not accurately account for the mechanics of real speech.
A still more realistic prior art approach to the production of synthetic visual speech is parametric modeling. In parametric modeling, a specific, detailed, 3D model has parameters associated with each of the parts of the face—most importantly, the articulators. The whole model is defined in terms of multiple parameters, and the position of every point on the 3D model is defined by an extensive formula. Systems using parametric modeling (such as the Baldi system developed at the University of Southern California, Santa Cruz (UCSC)) have been better able to take into account contextual influences of natural visual speech production and are thereby able to produce more realistic-looking visual speech.
Unfortunately, however, parametric modeling requires the construction of a very complex graphical model. Consequently, a massive amount of work is required to create or modify these models. Also, because each of the parameters is defined in terms of a specific equation developed for that 3D model only, parametric modeling systems are 3D model dependent. These systems cannot be easily adapted for use with other 3D models. The difficulty of modifying the system to drive other 3D models makes parametric modeling rigid, complex, and expensive. Parametric modeling, therefore, does hot offer a general purpose solution to the problem of providing realistic facial animation.
U.S. Pat. No. 5,657,426 (the '426 patent) to Waters, et al, describes various methods of producing synchronized synthetic audio and visual speech which attempt to take into account factors influencing the natural production of human speech. The '426 patent attempts to account for these factors by interpolating between visemes using non-linear functions, such as cosine functions or equations based on Newtonian laws of motion.
Other relevant prior art publications include Massaro, D. W., Beskow, J., Cohen, M. M., Fry, C. L., Rodriguez, T., “Picture My Voice: Audio to Visual Speech Synthesis using Artificial Neural Networks,” Proceedings of Auditory-Visual Speech Processing, Santa Cruz, Calif., August 1999; and Pelachaud, C., “Communication and Coarticulation in Facial Animation,” Doctoral Dissertation, University of, Pennsylvania, 1991. An extensive collection of references to facial animation (lipsyncing) related articles, developments, and general information can be found at the University of California, Santa Cruz internet website: http://mambo.ucsc.edu/ps1/fan.html.
The “Picture My Voice” article by Massaro, D. W., et al describes a synthetic visual speech production process, that is worth mentioning briefly. Particularly, the article discloses use of a neural network to produce parameters to control a lipsyncing animation. This system has several drawbacks. Its primary drawback it that it relies on parametric modeling. Accordingly, it requires the use of a parameter estimator in which the single neural network converts the audio speech input features into control parameters for manipulating a specific parameterized 3D model. It is therefore model dependent. Furthermore, articulator position and movement in this systems is fine-tuned for a specific speaker and is therefore also speaker dependent.
The industry has struggled to produce a general purpose solution to the problem of providing realistic computer-generated lipsyncing. Parametric modeling systems are 3D model dependent. Simpler, more adaptable prior art systems, on the other hand, fail to accurately account for the real-life parameters influencing human speech. What is needed, therefore, is a method and apparatus for generating realistic synthetic visual speech that is speaker, vocabulary, and model independent, and that accurately accounts for factors of natural human speech production without undue processing requirements. The industry is also in need of applications that take advantage of general purpose synthetic visual speech generation.
SUMMARY OF THE INVENTION
This invention provides a significant improvement in the art by enabling a method and apparatus for producing synthetic visual speech. The method of producing synthetic visual speech according to this invention includes receiving an input containing speech information. One or more visemes that correspond to the speech input are then identified. Next, the weights of those visemes are calculated using a coarticulation routine. The coarticulation routine includes viseme deformability information and calculates viseme weights based on a variety of factors including phoneme duration, and speech context. A synthetic visual speech output is produced based on the visemes' weights over time (or viseme tracks). Producing the synthetic visual speech output can include retrieving a three-dimensional (3D) model (target model) for each of the visemes and morphing between selected target models based on their weights.
Several general processes are possible based on the synthetic visual speech production method of the present invention. One such process converts separate voice a

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Methods and devices for producing and using synthetic visual... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Methods and devices for producing and using synthetic visual..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and devices for producing and using synthetic visual... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3017552

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.