Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-03-11
2002-09-10
Chawan, Vijay B (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S258000, C704S249000, C704S238000, C434S185000, C434S169000, C434S118000, C348S515000
Reexamination Certificate
active
06449595
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to audiovisual systems and, more particularly, to a system and methodology for face synthesis.
BACKGROUND OF THE INVENTION
Recently there has been significant interest in face synthesis. Face synthesis refers to the generation of a facial image in accordance with a speech signal, so that it appears to a viewer that the facial image is speaking the words uttered in the speech signal. There are many applications of face synthesis including film dubbing, cartoon character animation, interactive agents, and multimedia entertainment.
Face synthesis generally involves a database of facial images in correspondence with distinct sounds of a language. Each distinct sound of the language is referred to as a “phoneme,” and during pronunciation of a phoneme, the mouth and lips of a face form a characteristic, visible configuration, referred to as a “viseme.” Typically, the facial image database includes a “codebook” that maps each phoneme of a language to a corresponding viseme. Accordingly, the input speech text is segmented into phonemes, and the corresponding viseme for each phoneme is sequentially fetched from the database and displayed.
Realistic image quality is an important concern in face synthesis, and transitions from one sound to the next are particularly difficult to implement in a life-like manner because the mouth and lips are moving during the course of pronouncing a sound. In one approach, the mathematical routines are employed to interpolate a series of intermediate images from one viseme at one phoneme to the next. Such an approach, however, can result in an unnatural or distorted appearance, because the movements from one mouth and lip configuration to another are often non-linear.
In general, it is practical to store only a restricted number of phoneme/viseme sequences in the codebook. For example, image quality may be improved by storing visemes for all the allophones of a phoneme. An allophone of a phoneme is a slight, non-contrastive variation in pronunciation of the phoneme. A similar issue occurs in applying a face synthesis system originally developed for one language to speech in another language, because the other language includes additional phonemes lacking in the original language. Furthermore, the precise shape of a viseme is often dependent on the neighboring visemes, and there has been some interest in using sequences of phonemes of a given length, such as diphones.
Augmenting the codebook for every possible allophone, foreign phoneme, and phoneme sequences with their corresponding visemes consumes an unacceptably large amount of storage. In a common approach, aliasing techniques are employed in which visemes for a missing phoneme or sequence of phoneme are replaced by existing visemes in the codebook. Aliasing, however, tends to introduce artifacts at the frame boundaries, thereby reducing the realism of the final image.
SUMMARY OF THE INVENTION
Accordingly, there exists a need for a face synthesis system and methodology that generates realistic facial images. In particular, there is a need for handling transitions from one viseme to the next with improved realism. Furthermore, a need exists for generating realistic facial images for sequences of phonemes that are missing the codebook or for foreign language phonemes.
These and other needs are addressed by a method and computer-readable medium bearing instructions for synthesizing a facial image, in which a speech frame from an incoming speech signal is compared against acoustic features stored within an audio-visual codebook to produce a set of weights. These weights are used to generate a composite visual feature based on visual features corresponding to the acoustic features, and the composite visual feature is then used to synthesize a facial image. Generating a facial image based on a weighted composition of other images is a flexible approach that allows for more realistic facial images.
For example, more realistic viseme transitions during the course of pronunciation may be realized by using multiple samples of the acoustic and visual features for each entry in the audio-visual codebook, taken during the course of pronouncing a sound. Visemes for foreign phonemes can be generated by combining visemes from a combination of audio-visual codebook entries that correspond to native phonemes. For context-sensitive audio-visual codebooks with a restricted number of phoneme sequences, a weighted combination of features from visually similar phoneme sequences allows for a realistic facial image to be produced for a missing phoneme sequence.
In one embodiment, both the aforementioned aspects are combined so that each entry in the audio-visual codebook corresponds to a phoneme sequence and includes multiple samples of acoustic and visual features. In some embodiments, the acoustic features may be implemented by a set of line spectral frequencies and the visual features by the principal components of a Karhunen-Loewe transform of face points.
Additional objects, advantages, and novel features of the present invention will be set forth in part in the description that follows, and in part, will become apparent upon examination or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
REFERENCES:
patent: 4569026 (1986-02-01), Best
patent: 4884972 (1989-12-01), Gasper
patent: 4907276 (1990-03-01), Aldersberg
patent: 5608839 (1997-03-01), Chen
patent: 5630017 (1997-05-01), Gasper et al.
patent: 5657426 (1997-08-01), Waters et al.
patent: 5826234 (1998-10-01), Lyberg
patent: 5878396 (1999-03-01), Henton
patent: 5880788 (1999-03-01), Bregler
patent: 5884267 (1999-03-01), Goldenthal et al.
patent: 6112177 (2000-08-01), Cossatto et al.
patent: 0 689 362 (1995-12-01), None
patent: 0 710 929 (1996-05-01), None
patent: 2 231 246 (1990-11-01), None
patent: WO 97/36288 (1997-10-01), None
Chou et al., (“Speech recognition for image animation and coding”, 1995, International Conference on Acoustics, Speech, and Signal Processing, 1995, ICASSP-95, May 1995, vol. 4, pp. 2253-2256).*
Williams et al., (“Frame rate and viseme analysis for multimedia applications”, IEEEE First Workshop on Multimedia Signal processing, Jun. 1997, pp. 23-25).*
Gao et al., (“Synthesis of facial images with lip motion from several real views”, Third International Conference on Automatic Face and Gesture Recognition Proceedings, Apr. 1998, pp. 181-186).*
Ostermann (Animation of synthetic faces in MPEG-4, Proceedings Computer Animation 98, Jun. 1998, pp. 49-55).*
McAllister et al., (“Automated lip-sync animation as a telecommunications aid for the hearing impaired”, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for telecommunications Applications, Sep. 1998, pp. 112-117).*
Olives et al., (“Towards a high quality Finnish talking head”, 1999 IEEE 3rd Workshop on Multimedia Signal Processing, Sep. 1999, pp. 433-437).*
Ezzat et al., (“Miketalk: a talking facial display based on morphing visemes”, Proceedings Computer Animation 98, Jun. 1998, pp. 96-102).*
Yang et al., (“Automatic selection of visemes for image-based visual speech synthesis”, 2000 IEEE International Conference on Multimedia and Expo, 2000, Aug. 2000, vol. 2, pp. 1081-1084).*
Faruquie et al., (“large vocabulary audio-visual speech recognition using active shape models”, Proceedings, 15th international Conference on Pattern recognition, Sep. 2000, pp. 106-109).*
Sirovich, L. et al., “Low-Dimensional Procedure for the Characterization of Human Faces,” Journal of the Optical Society of America, vol. 4, No. 3, pp. 519-524 (Mar. 1, 1987).
Arslan Levent Mustafa
Talkin David Thieme
Chawan Vijay B
Magee Theodore M.
Microsoft Corporation
Westman Champlin & Kelly P.A.
LandOfFree
Face synthesis system and methodology does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Face synthesis system and methodology, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Face synthesis system and methodology will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2868548