Late integration in audio-visual continuous speech recognition

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S249000, C704S244000, C704S275000, C704S266000, C382S115000, C382S118000, C382S170000, C382S209000, C434S185000, C434S169000, C434S167000, C434S310000, C345S440000, C345S440000, C345S441000, C345S442000

Reexamination Certificate

active

06633844

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to the use of visual information in speech recognition.
BACKGROUND OF THE INVENTION
In speech recognition, the use of visual information has been of interest since speech recognition efficiency can be significantly improved in conditions where audio-only recognition suffers due to a noisy environment. Particularly, a main focus of recent developments has been to increase the robustness of speech recognition systems against different types of noises in the audio channel.
In this connection, it has been found that the performance of most, if not all, conventional speech recognition systems has suffered a great deal in a non-controlled environment, which may involve, for example, background noise, a bad acoustic channel characteristic, crosstalk and the like. Thus, video can play an important role in such contexts as it provides significant information about the speech that can compensate for noise in the audio channel. Furthermore, it has been observed that some amount of orthogonality is present between the audio and the video channel, and this orthogonality can be used to improve recognition efficiency by combining the two channels. The following publications are instructive in this regard: Tsuhan Chen and Ram R. Rao, “Audio-Visual Integration in Multimodal Communication”, Proceedings of IEEE, vol. 86, May 1998; H. McGurk and J. MacDonald, “Hearing Lips and seeing voices”, Nature, pp. 746-748, December 1976; and K. Green, “The use of auditory and visual information in phonetic perception”, Speechreading by Humans and Machines, D. Stork and M. Hennecke, Eds Berlin, Germany.
Experiments have also been conducted with various features of audio and visual speech and different methods of combining the two information channels. One of the earliest audio-visual speech recognition systems was implemented by E. D. Petajan (see E. D. Petajan, “Automatic lipreading to enhance speech recognition”, Proc. IEEE Global Telecommunication Conf., Atlanta, 1984; and E. D. Petajan, B. Bischoff, D. Bodoff and N. M. Brooke, “An improved automatic lipreading system to enhance speech recognition”, Proc. CHI'88 pp. 19-25). In Petajan's experiment, binary images were used to extract mouth parameters such as height, width and area of the mouth of the speaker. These parameters were later used in the recognition system. The recognition system was an audio speech recognizer followed by a visual speech recognizer. Therefore, a visual speech recognizer would work only on a subset of all of the possible candidates which were supplied to it by the audio speech recognizer. Later, the system was modified to use the images themselves instead of the mouth parameters and the audio-visual integration strategy was changed to a rule-based approach from the sequential integration approach.
A. J. Goldschen, in “Continuous automatic speech recognition by lipreading” (Ph.D. dissertation, George Washington University, Washington, September 1993), analyzed a number of features of the binary images such as height, width and perimeter, along with derivatives of these quantities, and used these features as the input to an HMM (Hidden Markov Model)-based visual speech recognition system. Since then, several experiments have been performed by various researchers to improve upon these basic blocks of audio-visual speech recognition (Chen et al., supra, and: Gerasimos Potamianos and Hans Peter Graf, “Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition”, ICASSP '98; Christopher Bregler and Yochai Konig, “‘Eigenlips’ for Robust Speech Recognition”, ICASSP '98; C. Bregler, Stefan Manke, Hermann Hild, Alex Waibel, “Bimodal Sensor Integration on the Example of ‘Speech Reading’”, IEEE International Conference on Neural Networks, 1993; Uwe Meier, Wolfgang Hürst and Paul Duchnowski, “Adaptive Bimodal Sensor Fusion for Automatic Speechreading”, ICASSP '96; C. Bregler, H. Manke, A. Waibel, “Improved Connected Letter Recognition by Lipreading”, ICASSP '93; and Mamoun Alissali, Paul Deleglise and Alexandrina Rogozan, “Asynchronous Integration of Visual Information in an Automatic Speech Recognition System”, ICSLP '96).
However, challenges are often encountered when there is a need to combine audio and visual streams in an intelligent manner. While a general discussion of data fusion may be found in “Mathematical Techniques in Multisensor Data Fusion” (David L. Hall, Artech House, 1992), the article “Audio-Visual Large Vocabulary Continuous Speech Recognition in the Broadcast Domain” (Basu et al., IEEE Workshop on Multimedia Signal Processing, Sep. 13-15, Copenhagen 1999) describes early attempts at audio-visual recognition. A need, however, has been recognized in connection with producing improved results.
Generally speaking, some problems have been recognized in conventional arrangements in combining audio with video for speech recognition. For one, audio and video features have different dynamic ranges. Additionally, audio and video features have different numbers of distinguishable classes, that is, there are typically a different number of phonemes than visemes. Further, due to complexities involved in articulatory phenomena, there tends to be a time offset between audio and video signals (see “Eigenlips”, supra). Moreover, video signals tend to be sampled at a slower rate than the audio and, therefore, needs to be interpolated.
In view of the problems stated above and others, two different approaches to combine audio and visual information have been tried. In the first approach, termed “early integration” or “feature fusion”, audio and visual features are computed from the acoustic and visual speech, respectively, and are combined prior to the recognition experiment. Since the two sets of features correspond to different feature spaces, they may differ in their characteristics as described above. Therefore, this approach essentially requires an intelligent way to combine the audio and visual features. The recognition is performed with the combined features and the output of the recognizer is the final result. This approach has been described in Chen et al., Potamianos et al., “Eigenlips” and Basu et al, supra. However, it has been found that this approach cannot handle different classifications in audio and video since it uses a common recognizer for both.
In the second approach, termed “late integration” or “decision fusion”, separate recognizers are incorporated for audio and visual channels. The outputs of the two recognizers are then combined to arrive at the final result. The final step of combining the two outputs is essentially the most important step in this approach since it concerns issues of orthogonality between the two channels as well as the reliability of the two channels. This approach tends to handle very easily the different classifications in audio and video channels as the recognizers for them are separate and the combination is at the output level. This approach has been described in “Bimodal Sensor Integration”, Meier et al., “Improved Connected Letter . . . ” and Alissali et al., supra.
However, it is to be noted that conventional approaches, whether involving “early” or “late” integration, use a single-phase experiment with a fixed set of phonetic or visemic classes and that the results are not always as favorable as desired. A need has thus been recognized in connection with providing a more effective combination strategy.
SUMMARY OF THE INVENTION
The present invention broadly contemplates method and apparatus for providing innovative strategies for data fusion, particularly, multi-phase (such as two-phase) hierarchical combination strategies. Surprising and unexpected results have been observed in connection with the inventive strategies.
In accordance with at least one presently preferred embodiment of the present invention, in particular, the combined likelihood of a phone is determined in two phases. In the first phase, a limited number of viseme-based classes (which will typically

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Late integration in audio-visual continuous speech recognition does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Late integration in audio-visual continuous speech recognition, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Late integration in audio-visual continuous speech recognition will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3113776

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.