Data processing: speech signal processing – linguistics – language – Speech signal processing – Application
Reexamination Certificate
2000-01-31
2004-11-02
To, Doris H. (Department: 2655)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Application
C704S258000, C704S260000, C704S270000
Reexamination Certificate
active
06813607
ABSTRACT:
DESCRIPTION
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to visual speech synthesis and, more particularly, to a method of implementing audio driven facial animation system in any language using a speech recognition system and visemes of a different language.
2. Background Description
Audio-driven facial animation is an interesting and evolving technique in the field of human-computer interaction. The realization of a natural and friendly interface is very important in human-computer interface. Speech recognition and computer lip-reading has been developed as a means of input for information interaction with the machine. It is also important to provide a natural and friendly means to render the information. Visual speech synthesis is very important in this respect as it can provide various kinds of animated computer agents which look very realistic. Furthermore, it can also be used for distance learning applications where it can obviate the transmission of video. It can also be a useful tool for hearing impaired people to compensate for lack of auditory information.
Techniques exist for synthesizing the speech given the text as input to the system. These text to speech synthesizers work by producing a phonetic alignment of the text to be pronounced and then by generating the smooth transitions in corresponding phones to get the desired sentence. See R. E. Donovan and E. M. Eide, “The IBM Trainable Speech Synthesis System”,
International Conference on Speech and Language Processing”,
1998. Recent work in bimodal speech recognition uses the fact that the audio and corresponding video signals have dependencies which can be exploited to improve the speech recognition accuracy. See T. Chen and R. R. Rao, “Audio-Visual Integration in Multimodal Communication”,
Proceedings of the IEEE
, vol. 86, no. 5, May 1998, pp. 837-852, and E. D. Petajan, B. Bischolf, D. Bodolf, and N. M. Brooke, “An Improved Automatic Lipreading System to Enhance Speech Recognition”,
Proc. OHI,
1988, pp. 19-25. A viseme-to-phoneme mapping is required to convert the score from video space to the audio space. Using such a mapping and the text-to-speech synthesis, a text-to-video synthesizer can be built. This synthesis or facial animation can be driven by text or speech audio, as the application may desire. In the later case, the phonetic alignment is generated from the audio with the help of the true word string representing the spoken word.
Researchers have tried various ways of synthesizing visual speech from a given audio signal. In the simplest method, vector quantization is used to divide the acoustic vector space into a number of subspaces (generally equal to the number of phones) and the centroid of each subspace is mapped to a distinct viseme. During the synthesis time, the nearest centroid is found for the incoming audio vector and the corresponding viseme is chosen as the output. In F. Lavagetto, Arzarello and M. Caranzano, “Lipreadable Frame Automation Driven by Speech Parameters”,
International Symposium on Speech, Image Processing and Neural Networks,
1994, ISSIPNN, the authors have used Hidden Markov Models (HMMs) which are trained using both audio and video features as follows. During the training period, viterbi alignment is used to get the most likely HMM state sequence for a given speech. Now, for a given HMM state, all the corresponding image frames are chosen and an average of their visual parameters is assigned to the HMM state. At the time of synthesis, input speech is aligned to the most likely HMM sequence using the viterbi decoding. Image parameters corresponding to the most likely HMM state sequence are retrieved, and this visual parameter sequence is animated with proper smoothing.
Recently, co-pending patent application Ser. No. 09/384,763 describes a novel way of generating the visemic alignments from an audio signal which makes use of viseme based HMM. In this approach, all the audio vectors corresponding to a given viseme are merged into a single class. Now, this viseme based audio data is used to train viseme based audio HMMs. During the synthesis time, input speech is aligned with the viseme based HMM state sequence. Now, the image parameters corresponding to these viseme based HMM state sequences are animated with the required smoothing. See also T. Ezzai and T. Poggio, “Miketalk: A Talking Facial Display Based on Morphing Visemes”,
Proceedings of IEEE Computer Animation ′
98, Philadelphia, Pa, June 1998, pp. 96-102.
All of the above approaches require training of a speech recognition system which is used for generating alignment of the input speech needed for synthesis. Further, these approaches require a speech recognition system in the language in which audio is provided to get the time alignment for the phonetic sequence of the audio signal. However, building a speech recognition system is a very tedious and time consuming task.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a novel scheme to implement a language independent system for audio-driven facial animation given the speech recognition system for just one language; e.g., English. The same method can also be used for text to audiovisual speech synthesis.
The invention is based on the recognition that once the alignment is generated, the mapping and the animation hardly have any language dependency in them. Translingual visual speech synthesis can be achieved if the first step of alignment generation can be made speech independent. In the following, we propose a method to perform translingual visual speech synthesis; that is, given a speech recognition system for one language (the base language), the invention provides a method of synthesizing video with speech of any other language (the novel language) as the input.
REFERENCES:
patent: 5608839 (1997-03-01), Chen
patent: 5657426 (1997-08-01), Waters et al.
patent: 5878396 (1999-03-01), Henton
patent: 5995119 (1999-11-01), Cosatto et al.
patent: 6112177 (2000-08-01), Cosatto et al.
patent: 6122616 (2000-09-01), Henton
patent: 6250928 (2001-06-01), Poggio et al.
patent: 6317716 (2001-11-01), Braida et al.
patent: 6366885 (2002-04-01), Basu et al.
patent: 6449595 (2002-09-01), Arslan et al.
patent: 6539354 (2003-03-01), Sutton et al.
patent: 0674315 (1995-09-01), None
patent: 05-298346 (1993-11-01), None
patent: WO99/46732 (1999-09-01), None
R.E. Donovan, et al., “The IBM Trainable Speech Synthesis System”,International Conference on Speech and Language Processing, 1998.
E.D. Petajan, et al., “An Improved Automatic Lipreading System to Enhance Speech Recognition”,Proc. OHI,1988, pp. 19-25.
T. Chen et al., “Audio-Visual Integration in Multimodal Communication”, Proceedings of the IEEE, vol. 86, No. 5, May 1998.
F. Lavagetto, et al., “Lipreadable Frame Animation Driven by Speech Parameters”, 1994 International Symposium on Speech, Image Processing and Neural Networks, Apr. 13-16, 1994, Hong Kong.
Faruquie Tanveer Afzal
Neti Chalapathy
Rajput Nitendra
Subramaniam L. Venkata
Verma Ashish
Coca T. Rao
Opsasnick Michael N.
To Doris H.
Whitham Curtis & Christofferson, P.C.
LandOfFree
Translingual visual speech synthesis does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Translingual visual speech synthesis, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Translingual visual speech synthesis will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3349261