Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-08-06
2003-07-15
Abebe, Daniel (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S270000, C704S233000
Reexamination Certificate
active
06594629
ABSTRACT:
CROSS REFERENCE TO RELATED APPLICATIONS
The present application is related to the U.S. patent application entitled: “Methods And Apparatus for Audio-Visual Speaker Recognition and Utterance Verification,” filed concurrently herewith and incorporated by reference herein.
FIELD OF THE INVENTION
The present invention relates generally to speech detection and recognition and, more particularly, to methods and apparatus for using video and audio information to provide improved speech detection and recognition in connection with arbitrary content video.
BACKGROUND OF THE INVENTION
Although significant progress has been made in machine transcription of large vocabulary continuous speech recognition (LVCSR) over the last few years, the technology to date is most effective only under controlled conditions such as low noise, speaker dependent recognition and read speech (as opposed to conversational speech).
In an attempt to improve speech recognition, it has been proposed to augment the recognition of speech utterances with visual cues. This approach has attracted the attention of researchers over the last couple of years, however, most efforts in this domain can be considered to be only preliminary in the sense that, unlike LVCSR efforts, tasks have been limited to small vocabulary (e.g., commands, digits) and often to speaker dependent training or isolated word speech where word boundaries are artificially well defined.
The potential for joint audio-visual-based speech recognition is well established on the basis of psycho-physical experiments, e.g., see D. Stork and M. Henecke, “Speechreading by humans and machines,” NATO ASI Series, Series F, Computer and System Sciences, vol. 150, Springer Verlag, 1996; and Q. Summerfeld, “Use of visual information for phonetic perception,” Phonetica (
36
), pp. 314-331 1979. Efforts have begun recently on experiments with small vocabulary letter or digit recognition tasks, e.g., see G. Potamianos and H. P.
Graf, “Discriminative training of HMM stream exponents for audio-visual speech recognition,” ICASSP, 1998; C. Bregler and Y. Konig, “Eigenlips for robust speech recognition,” ICSLP, vol II, pp. 669-672, 1994;and R. Stiefelhagen, U. Meier and J. Yang, “Real-time lip-tracking for lipreading,” preprint. In fact, canonical mouth shapes that accompany speech utterances have been categorized, and are known as visual phonemes or “visemes.” Visemes provide information that complements the phonetic stream from the point of view of confusability. For example, “mi” and “ni” which are confusable acoustically, especially in noisy situations, are easy to distinguish visually: in “mi” lips close at onset, whereas in “ni” they do not. As a further example, the unvoiced fricatives “f” and “s”, which are difficult to recognize acoustically, belong to two different viseme groups. However, use of visemes has, to date, been limited to small vocabulary type recognition and the actual video cues have only been derived from controlled or non-arbitrary content video sources.
Thus, it would be highly desirable to provide methods and apparatus for employing visual information in conjunction with corresponding audio information to perform improved speech recognition, particularly, in the context of arbitrary content video.
Another related problem that has plagued conventional speech recognition systems is an inability of the recognizer to discriminate between extraneous audible activity, e.g., background noise or background speech not intended to be decoded, and speech that is indeed intended to be decoded. Due to this deficiency, a speech recognition system typically attempts to decode any signal picked up by its associated microphone whether it be background noise, background speakers, etc. One solution has been to employ a microphone with a push-to-talk button. In such case, the recognizer will only begin to decode audio picked up by the microphone when the speaker pushes the button. However, this approach has obvious limitations. For example, the environment in which the speech recognition system is deployed may not safely permit the user to physically push a button, e.g., a vehicle mounted speech recognition system. Also, once the speaker pushes the button, any extraneous audible activity can still be picked up by the microphone causing the recognizer to attempt to decode it. Thus, it would be highly advantageous to provide methods and apparatus for employing visual information in conjunction with corresponding audio information to accurately detect speech intended to be decoded so that the detection result would serve to automatically turn on/off decoders and/or a microphone associated with the recognition system.
SUMMARY OF THE INVENTION
The present invention provides various methods and apparatus for using visual information and audio information associated with arbitrary video content to provide improved speech recognition accuracy. Further, the invention provides methods and apparatus for using such visual information and audio information to decide whether or not to decode speech uttered by a speaker.
In a first aspect of the invention, a method of providing speech recognition comprises the steps of processing a video signal associated with an arbitrary content video source, processing an audio signal associated with the video signal, and decoding the processed audio signal in conjunction with the processed video signal to generate a decoded output signal representative of the audio signal. The video signal processing operations may preferably include detecting face candidates in the video signal, detecting facial features associated with the face candidates, and deciding whether at least one of the face candidates is in a frontal pose. Fisher linear discriminant (FLD) analysis and distance from face space (DFFS) measures are preferably employed in accordance with these detection techniques. Also, assuming at least one face is detected, visual speech feature vectors are extracted from the video signal. The audio signal processing operation may preferably include extracting audio feature vectors.
In one embodiment, phoneme probabilities are separately computed based on the visual feature vectors and the audio feature vectors. Viseme probabilities may alternatively be computed for the visual information. The probability scores are then combined, preferably using a confidence measure, to form joint probabilities which are used by a search engine to produce a decoded word sequence representative of the audio signal. In another embodiment, the visual feature vectors and the audio feature vectors are combined such that a single set of probability scores are computed for the combined audio-visual feature vectors. Then, these scores are used to produce the decoded output word sequence. In yet another embodiment, scores computed based on the information in one path are used to re-score results in the other path. A confidence measure may be used to weight the re-scoring that the other path provides.
In a second aspect of the invention, a method of providing speech detection in accordance with a speech recognition system comprises the steps of processing a video signal associated with a video source to detect whether one or more features associated with the video signal are representative of speech, and processing an audio signal associated with the video signal in accordance with the speech recognition system to generate a decoded output signal representative of the audio signal when the one or more features associated with the video signal are representative of speech. In one embodiment, a microphone associated with the speech recognition system is turned on such that an audio signal from the microphone may be initially buffered if at least one mouth opening is detected from the video signal. Then, the buffered audio signal is decoded in accordance with the speech recognition system if mouth opening pattern recognition indicates that subsequent portions of the video signal are representative of speech. The decoding operation results in a decoded output word sequence re
Basu Sankar
de Cuetos Philippe Christian
Maes Stephane Herman
Neti Chalapathy Venkata
Senior Andrew William
Abebe Daniel
Dang Thu Ann
International Business Machines - Corporation
Ryan & Mason & Lewis, LLP
LandOfFree
Methods and apparatus for audio-visual speech detection and... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and apparatus for audio-visual speech detection and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for audio-visual speech detection and... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3019717