Data processing: speech signal processing – linguistics – language – Speech signal processing – Application
Reexamination Certificate
2000-03-20
2004-01-06
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Application
C704S201000
Reexamination Certificate
active
06675145
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention is related to the article “Using Speech Acoustics to Drive Facial Motion”, by Hani Yehia, Takaaki Kuratate and Eric Vatikiotis-Bateson (Proceedings of the 14th International Congress of Phonetic Sciences, Vol.1, pp.631-634, American Institute of Physics, August 1999), which follows attached.
1. Field of the Invention
The present invention is an electronic communication technique. More specifically, it consists of a method and system used for digital encoding-decoding of audiovisual speech, i.e. facial image and sound produced by a speaker. The signal is encoded at low bit-rates. The speech acoustics is represented in a parametric form where as facial image estimated from speech acoustic parameters by means of a statistical model.
2. Description of the Background Art
Developments in wide area computer networks and edigital communication techniques have contributed to the practical use of video conference systems. These systems enable persons at remote locations to have a conference through a network. Also, telephone communication can be expanded to incorporate video information by means of digital cameras (CCD) currently available. Such systems, however, require bit-rates sufficiently low so that the users' demand is compatible with channel capacity.
Using conventional techniques, transmission of image signals require a bit-rate between two and three orders of magnitude larger than that required for the transmission of telephone speech acoustics. Thus, if video is to be transmitted over a telephone line, the frame rate has to be very low.
One way to solve this problem is to increase the bit-rate capacity of the channel Such a solution is, however, expensive and, hence, not practical. Moreover, the increasing demand for real time video communications justify efforts in the direction innovative video compression techniques.
Video compression rate is limited if done without taking into account the contents of the image sequence that forms the video signal. In the case of audiovisual speech coding, however, it is know that the image being encoded is that of a human face. The use of this information allows the development of compression techniques which are much more efficient. Furthermore, during speech, the acoustic signal is directly related to the speaker's facial motion. Thus, if the redundancy between audio and video signals is reduced, larger compression rates can be achieved. The technique described in this text goes in this direction.
SUMMARY OF THE INVENTION
The objective of the present invention is to provide a method and system of audiovisual speech coding, which is capable of transmitting and recovering a speaker's facial motion and speech audio with high quality even through a channel of limited capacity.
This objective is achieved in two steps. First, facial images are encoded based on the a priori information that the image being encoded is that of a human face. Second, the dependence between speech acoustics and facial motion is used to allow facial image recovery from the speech audio signal.
In the present invention, the method of transmitting facial image includes the following steps: (1) setup, at the receiver, of a facial shape estimator which receives the speech audio signal as input and generates a facial image of the speaker as output; (2) transmission of the speech audio signal to the receiver, and (3) generation of the facial images which form the speaker's video signal.
Thus, transmission of only the speech audio signal enables the receiver to generate the speaker's facial video. The facial image can then be transmitted with high efficiency, using a channel of far lower bit-rate, as compared with the transmission bit-rate required for standard image coding.
Preferably, the setup step is divided in the following parts: (1.a) specification of an artificial neural network architecture to be used at both transmitter and receiver sides; (1.b) training of the artificial neural network on the transmitting side so that facial images determined from the speech audio signal match original facial images as well as possible; and (1.c) transmission of the weights of the trained artificial neural network to the receiver.
The artificial neural network of the transmitter side is trained and its parameters are sent to the receiver side before communication starts. So, the artificial neural network of the receiving side is set identically to that of the transmitter side when communication is established. Thus it is ready for audiovisual speech communication using only the speech audio to recover the speech video counterpart.
Preferably, the step of neural network training consists of measuring coordinates of predetermined portions of a speaker's face during speech production on the transmitting side; simultaneous extraction of parameters from the speech audio signal; and adjusting the weights of the artificial neural network using the speech audio parameters as input and the facial measured coordinates as reference signal.
The artificial neural network is trained for each speaker. Therefore, efficient real time transmission of facial images of an arbitrary speaker is possible.
Preferably, the method of face image transmission also includes the following steps: measuring, for each frame, coordinates of predetermined portions of the speaker's face during speech production; applying the speech audio signal to the trained artificial neural network of the transmitting side to obtain estimated values of the coordinates of the predetermined portions of the speaker's; and comparing measured and estimated coordinate values to find the estimation error.
As the error between the estimated coordinate values of the predetermined positions of the speaker's face estimated by the artificial neural network and the actual coordinates of the predetermined positions of the speaker's face on the transmitting side is found, it becomes possible to determine to which extent the face image of the speaker generated on the receiving side through communication matches the speech.
Preferably, the method of face image transmission further includes the following steps: transmitting the estimation error to the receiving side; and correcting the output of the artificial neural network on the receiving side based on the estimation error received. The precision used to transmit the estimation error is, however, limited by the channel capacity (bit-rate) available.
As the error signal obtained on the transmitting side is transmitted to the receiving side, it becomes possible to correct the image obtained on the receiving side by using the error signal. As a result, a video signal of the speaker's face matching the speech signal can be generated.
Preferably, the method of face image transmission further includes the following steps: comparing magnitude of the estimation error with a predetermined threshold value; when the magnitude of the error exceeds the threshold value, transmitting the error to the receiving side in response; and correcting the output of the artificial neural network on the receiving side based on the received error.
As the error signal obtained on the transmitting side is transmitted to the receiving side when the magnitude of the error signal obtained on the transmitting side exceeds the predetermined threshold value, it becomes possible to correct the image obtained on the receiving side by using the error signal. As a result, the video signal of the speaker's face matching the speech signal can be obtained. The error signal is not always transmitted, and the bit-rate used to transmit it is chosen so that transmission of the speech signal is not hindered.
According to another aspect of the present invention, the system for transmitting the audio signal and the video signal of the face of a speaker during speech production on the transmitting side to the receiving side includes: a transmission apparatus for transmitting the speech audio signal produced by the speaker to the rec
Kuratate Takaaki
Vatikiotis-Bateson Eric
Yehia Hani
Advanced Telecommunications Research Institute International
Armstrong Angela
Dorvil Richemond
LandOfFree
Method and system for integrated audiovisual speech coding... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for integrated audiovisual speech coding..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for integrated audiovisual speech coding... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3234371