Talking facial display method and apparatus

Education and demonstration – Language – Speech

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C434S167000, C434S169000, C345S473000

Reexamination Certificate

active

06250928

ABSTRACT:

BACKGROUND OF THE INVENTION
There has been an increased interest recently in the development of text-to-audio-visual speech synthesis (TTAVS) systems, in which standard text-to-speech (TTS) synthesizers are augmented with a visual component thereby taking on the form of an image of a talking face. This interest is driven by the possible deployment of the systems as visual desktop agents, digital actors, and virtual avatars. In addition, these TTAVS systems may also have potential uses in very low bandwidth video conferencing and special effects, and would also be of interest to psychologists who wish to study visual speech production and perception.
An important aspect which might be desired of these facial TTAVS systems is video realism: the ability of the final audio-visual output to look and sound exactly as if it were produced by a real human face that was recorded by a video camera.
Unfortunately, much of the recent work in this field falls short of producing the impression of video realism. The reason for this, the inventors believe, is that most of the current TTAVS systems have chosen to integrate 3D graphics-based facial models with the audio speech synthesis. See M. M. Cohen and D. W. Massaro, “Modeling coarticulation in synthetic visual speech,” in
Models and Techniques in Computer Animation,
pages 139-156, N. M. Thalmann and D. Thalmann, editors, Springer-Verlag, Tokyo, 1993. See also B. LeGoff and C. Benoit, “A text-to-audio-visual speech synthesizer for french,” in
Proceedings of the International Conference on Spoken Language Processing (ICSLP),
Philadelphia, USA, October 1996. Although it is possible to improve visual realism through texture-mapping techniques, it seems that there is an inherent difficulty in modeling both the complex visual appearance of a human face and the underlying facial mouth movement dynamics using 3D graphics-based methods.
Besides the underlying facial mouth movement dynamics problems, there is difficulty in constructing a visual speech stream, where it is not sufficient to simply display the viseme images in sequence. Doing so would create the disturbing illusion o very abrupt mouth movement, since the viseme images differ from each other in shape significantly. Consequently, a mechanism of transitioning from each viseme image to every other viseme image is needed and this transition must be smooth and realistic. This need prompted a study in what is known as morphing, which is a technique adopted to create smooth and realistic viseme transitions.
Morphing was first popularized by Beier & Neely, see T. Beier and S. Neely, “Feature-based Image Metamorphosis”, in
SIGGRAPH '
92
Proceedings,
pages 35-42, Chicago, Ill., 1992, in the context of generating transitions between different faces for Michael Jackson's Black or White music video. The transformations between images occur as a warp of the first image into the second, a similar inverse warp of the second image into the first, and a final cross-dissolve or blend of the warped images. It should be noted that those involved in the early studies noticed the viability of using morphing as a method of transitioning between various facial pose, expression, and mouth position imagery.
The difficulty with traditional morphing approaches is that the specification of the warp between the images requires the definition of a set of high-level features. These features serve to ensure that the warping process preserves the desired correspondence between the geometric attributes of the objects to be morphed. For example, if one were morphing between two faces, one would want the eyes in one face to map to the eyes in the other face, the mouth in one face to map to the mouth in the other face, and so on. Consequently, the correspondence between these eyes and mouth features would need to be specified.
When morphing/warping is done by hand, however, this feature specification process can become quite tedious and complicated, especially in cases when a large amount of imagery is involved. In addition, the process of specifying the feature regions usually requires hand-coding a large number of ad-hoc geometric primitives, such as line segments, comer points, arches, circles, and meshes. Beier & Neely, in fact, make the explicit statement that the specification of the correspondence between images constitutes the most time-consuming aspect of the morph. Therefore, there is a need to automate and improve this traditional method of morphing as it is utilized in making a photo-realistic talking facial display.
SUMMARY OF THE INVENTION
The current invention alleviates problems of producing the impression of the photo-realistic talking face by starting with a human-subject video image rather than a computer generated 3D model and applying techniques to make the human-subject appear photo-realistic when synchronized with input text. In addition, the time-consuming causes in previous morphing techniques have been eliminated through use of optical flow methods and implemented in the current invention.
The present invention provides a method and apparatus of converting input text into an audio-visual speech stream resulting in a talking face image enunciating the text. The audio-visual speech stream contains phoneme and timing information. The talking face image is built using visemes, where these visemes are defined by a set of images spanning a large range of mouth shapes derived from a recorded visual corpus of a human-subject. The present invention method of converting input text into an audio-visual speech stream comprises the steps of (i) recording a visual corpus of a human-subject, (ii) building a viseme interpolation database, (iii) and synchronizing the talking face image with the text stream. The database is filled with a subset of visemes from recorded visual corpus and at least one set of interpolation vectors that define a transition from each viseme image to every other viseme image.
In a preferred embodiment, the transitions are automatically calculated using optical flow methods and morphing techniques are employed to result in smooth viseme transitions. The viseme transitions are concatenated together and synchronized with the phonemes according to the timing information. The audio-visual speech stream is then displayed in real time, thereby displaying a photo-realistic talking face.
In another embodiment of the present invention, the human-subject enunciates a set of key words, where the set of key words is specifically designed to elicit at least on( instantiation of each viseme. The set of enunciating the key words comprises the step o enunciating between 40 and about 50 words from the english language. In a further embodiment of the present invention, recording a visual corpus of a human-subject results in an optical recording of a three dimensional image of the human-subject, where the three dimensional image recording has a plurality of three dimensional image properties capable of being altered. Three dimensional image properties are selected from a group consisting of lighting, shadowing, depth of field, focus, sharpness, color balance, grey scale, saturation, brightness, field of view, and cropping.
In a preferred embodiment of the invention method, building a viseme interpolation database comprises the steps of (i) identifying each viseme as corresponding to a phoneme and (ii) extracting a plurality of visemes from the visual corpus. Identifying each viseme comprises the steps of searching through said recording and relating each viseme on each recorded frame of the recording to a phoneme. In an embodiment of the present invention, the steps of searching and relating are performed manually. Relating each viseme comprises the steps of subjectively rating each viseme and phoneme combination and selecting a final set of visemes from among said rated viseme and phoneme combinations. The invention method further comprising the step of attaching attributes to each viseme, where the attributes define characteristics of the human-subject. Characteristics of the human-subject

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Talking facial display method and apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Talking facial display method and apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Talking facial display method and apparatus will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2486248

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.