Film language

Optics: motion pictures – With sound accompaniment – Picture and sound synchronizing

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C352S005000, C352S023000, C352S025000

Reexamination Certificate

active

06778252

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to cinematic works, and particularly to altered cinematic works where the facial motion and audio speech vocal tract dynamics of a voice dub speaker are matched to animate the facial and lip motion of a screen actor, thereby replacing the sound track of a motion picture with a new sound track in a different language.
2. Background of the Invention
In the field of audio dubbing, there are many cinematic and television works where it is desirable to have a language translation dub of an original cinematic or dramatic work, where the original recorded voice track is replaced with a new voice track. In one case, it is desirable to re-record a screen actor's speech and dub it onto the original visual track. In another case it is desirable to have the audio dub be exactly synchronized with the facial and lip motions of the original cinematic speaker. Instead of needing to re-shoot the actor speaking the scene again, the dubbing process provides an opportunity to change the voice.
Prior approaches to generating lip and mouth motion synchronization to new voice sound tracks have largely been manual processes using computer audio and graphical processing and special effects tools. There have been recent developments towards automating the voice dubbing process using 2D based techniques to modify archival footage (Bregler), using computer vision techniques and audio speech recognition techniques to identify, analyze and capture visual motions associated with specific speech utterances. Prior approaches have concentrated on creating concatenated based synthesis of new visuals to synchronize with new voice dub tracks from the same or other actors, in the same or other languages. This approach analyzes screen actor speech to convert it into triphones and/or phonemes and then uses a time coded phoneme stream to identify corresponding visual facial motions of the jaw, lips, visible tongue and visible teeth. These single frame snapshots or multi-frame clips of facial motion corresponding to speech phoneme utterance states and transformations are stored in a database, which are then subsequently used to animate the original screen actor's face, synchronized to a new voice track that has been converted into a time-coded, image frame-indexed phoneme stream.
Concatenated based synthesis relies on acquiring samples of the variety of unique facial expression states corresponding to pure or mixed phonemes, as triphones or diphones. The snapshot image states, or the short clip image sequences are used, as key frame facial speech motion image sets, respectively, and are interpolated for intermediate frames between key frames using optical morph techniques. This technique is limited by being essentially a symbol system that uses atomic speech and facial motion states to synthetically continuously animate facial motion by identifying the facial motions and interpolating between key frames of facial motion. The actual transformation paths from first viseme state to second viseme state are estimated using either short clips, or by hand, frame to frame, or estimated by standard morph animation techniques using various functions curved to smooth the concatenation process.
BRIEF SUMMARY OF THE INVENTION
The invention comprises a method for accumulating an accurate database of learned motion paths of a speaker's face and mouth during speech, and applying it to directing facial animation during speech using visemes.
Visemes are collected by means of using existing legacy material or by the ability to have access to actors to generate reference audio video “footage”. When the screen actor is available, the actor speaks a script eliciting all the different required phonemes and co-articulations, as would be commonly established with the assistance of a trained linguist. This script is composed for each language or actor on a case by case basis. The script attempts to elicit all needed facial expressive phonemes and co-articulation points.
The sentences of the script first elicit speech-as-audio, to represent mouth shapes for each spoken phoneme as a position of the mouth and face. These sentences then elicit speech-as-motion, to derive a requisite range of facial expressive transformations. Such facial transformations include those as effected from (1) speaking words for capturing the facial motion paths as a variety of diphones and triphones needed to represent the new speech facial motions, and (2) making emotional facial gestures. Common phoneme range elicitation scripts exist as alternate sets of sentences, which are used to elicit all the basic phonemes, such as the “Rainbow Passage” for example. To elicit all types of transformation between one phoneme and another requires using diphones, the sound segment that is the transformation between one phoneme and another, for all the phonemes. As Bregler confirmed in U.S. Pat. No. 5,880,788, triphones can have many thousands of different transformations from one phoneme sound and corresponding mouth shape to another. Triphones are used to elicit the capture of visual facial and mouth shape transformations from one speech phoneme mouth position dynamically to another phoneme mouth position and capture the most important co-articulation facial motion paths that occur during speech.
The actual motion path of a set of fixed reference points, while the mouth moves from one phoneme to another, is recorded and captured for the entire transformation between any set of different phonemes. As the mouth naturally speaks one speech phoneme and then alters its shape to speak another phoneme, the entire group of fixed reference points move and follow a particular relative course during any phoneme to phoneme transformation. Triphones capture the requisite variety of facial and mouth motion paths. Capturing many examples of many of these different phoneme to phoneme mouth shape transformations from a speaking actor is completed.
There are two sources of capture: target footage and reference footage. The target footage is the audio visual sequence. Elicitation examples are selected to accommodate the phoneme set of the language of the target footage, creating a database. This database of recorded mouth motions is used as a training base for a computer vision motion tracking system, such as the eigen-images approach described in Pentland et al, “View-Based and Modular Eigenspaces for Face Recognition”, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 1994, pps. 84-91. The computer vision system analyzes the training footage by means of practicing its vision analysis on the visual footage to improve fixed reference point tracking and identification for each frame. A speech recognition system or linguist is used to recognize phonemes in the training footage which can be used as index points to select frames for usage as visemes.
If the training footage is of a screen actor, it permits a computer vision motion tracking system to learn the probable optical flow paths for each fixed reference point on the face for different types of mouth motions corresponding to phoneme transitions. These identified and recorded optical reference point flow paths during speech facial motions are recorded and averaged over numerous captured examples of the same motion transformations.
A database of recorded triphone and diphone mouth transformation optical flow path groups are accumulated. For commonly used speech transformations, commonly used spline based curve fitting techniques are applied to estimate and closely match the recorded relative spatial paths and the rate of relative motions during different transformations. The estimated motion path for any reference point on the face, in conjunction with all the reference points and rates of relative motion change during any mouth shape transformation, is saved and indexed for later production usage.
An emotional capture elicitation process is effected by having the actor get in the mood of

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Film language does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Film language, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Film language will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3272895

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.