Methods and apparatus for audio-visual speaker recognition...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S231000, C704S273000

Reexamination Certificate

active

06219640

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to speaker recognition and, more particularly, to methods and apparatus for using video and audio information to provide improved speaker recognition and utterance verification in connection with arbitrary content video.
BACKGROUND OF THE INVENTION
Humans identify speakers based on a variety of attributes of the person which include acoustic cues, visual appearance cues and behavioral characteristics (e.g., such as characteristic gestures, lip movements). In the past, machine implementations of person identification have focused on single techniques relating to audio cues alone (e.g., audio- based speaker recognition), visual cues alone (e.g., face identification, iris identification) or other biometrics. More recently, researchers are attempting to combine multiple modalities for person identification, see, e.g., J. Bigun, B. Duc, F. Smeraldi, S. Fischer and A. Makarov, “Multi-modal person authentication,” In H. Wechsler, J. Phillips, V. Bruce, F. Fogelman Soulie, T. Huang (eds.) Face Recognition: From theory to applications, Berlin Springer- Verlag, 1999.
Speaker recognition is an important technology for a variety of applications including security and, more recently, as an index for search and retrieval of digitized multimedia content (for instance in the MPEG-7 standard). Audio-based speaker recognition accuracy under acoustically degraded conditions (e.g., such as background noise) and channel mismatch (e.g., telephone) still needs further improvements. To make improvements in such degraded conditions is a difficult problem. As a result, it would be highly advantageous to provide methods and apparatus for providing improved speaker recognition that successfully perform in the presence of acoustic degradation, channel mismatch, and other conditions which have hampered existing speaker recognition techniques.
SUMMARY OF THE INVENTION
The present invention provides various methods and apparatus for using visual information and audio information associated with arbitrary video content to provide improved speaker recognition accuracy. It is to be understood that speaker recognition may involve user enrollment, user identification (i.e., find who the person is among the enrolled users), and user verification (i.e., accept or reject an identity claim provided by the user). Further, the invention provides methods and apparatus for using such visual information and audio information to perform utterance verification.
In a first aspect of the invention, a method of performing speaker recognition comprises processing a video signal associated with an arbitrary content video source and processing an audio signal associated with the video signal. Then, an identification and/or verification decision is made based on the processed audio signal and the processed video signal. Various decision making embodiments may be employed including, but not limited to, a score combination approach, a feature combination approach, and a re-scoring approach.
As will be explained in detail, the combination of audio-based processing with visual processing for speaker recognition significantly improves the accuracy in acoustically degraded conditions such as, for example only, the broadcast news domain. The use of two independent sources of information brings significantly increased robustness to speaker recognition since signal degradations in the two channels are uncorrelated. Furthermore, the use of visual information allows a much faster speaker identification than possible with acoustic information alone. In accordance with the invention, we present results of various methods to fuse person identification based on visual information with identification based on audio information for TV broadcast news video data (e.g., CNN and CSPAN) provided by the Linguistic Data Consortium (LDC). That is, we provide various techniques to fuse video based speaker recognition with audio-based speaker recognition to improve the performance under mismatch conditions. In a preferred embodiment, we provide technique to optimally determine the relative weights of the independent decisions based on audio and video to achieve the best combination. Experiments on video broadcast news data suggest that significant improvements are achieved by such a combination in acoustically degraded conditions.
In a second aspect of the invention, a method of verifying a speech utterance comprises processing a video signal associated with a video source and processing an audio signal associated with the video signal. Then, the processed audio signal is compared with the processed video signal to determine a level of correlation between the signals. This is referred to as unsupervised utterance verification. In a supervised utterance verification embodiment, the processed video signal is compared with a script representing an audio signal associated with the video signal to determine a level of correlation between the signals.
Of course, it is to be appreciated that any one of the above embodiments or processes may be combined with one or more other embodiments or processes to provide even further speech recognition and speech detection improvements.
Also, it is to be appreciated that the video and audio signals may be of a compressed format such as, for example, the MPEG-2 standard. The signals may also come from either a live camera/microphone feed or a stored (archival) feed. Further, the video signal may include images of visible and/or non-visible (e.g., infrared or radio frequency) wavelengths. Accordingly, the methodologies of the invention may be performed with poor lighting, changing lighting, or no light conditions. Given the inventive teachings provided herein, one of ordinary skill in the art will contemplate various applications of the invention.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.


REFERENCES:
patent: 4449189 (1984-05-01), Feix et al.
patent: 4757451 (1988-07-01), Beadles
patent: 4845636 (1989-07-01), Walker
patent: 5412738 (1995-05-01), Brunelli et al.
patent: 5602933 (1997-02-01), Blackwell et al.
patent: 5625704 (1997-04-01), Prasad
patent: 5897616 (1999-04-01), Kanevsky et al.
C. Neti et al., “Audio-Visual Speaker Recognition For Video Broadcast News”, Proceedings of the ARPA HUB4 Workshop, Washington, D.C., pp. 1-3, Mar. 1999.
A.W. Senior, “Face and Feature Finding For a Face Recognition System,” Second International Conference on Audio-and Video-based Biometric Person Authentication, Washington, D.C., pp. 1-6, Mar. 1999.
P. De Cuetos et al., “Frontal Pose Detection for Human-Computer Interaction,” pp. 1-12, Jun. 23, 1999.
R. Stiefelhagen et al., “Real-Time Lip-Tracking for Lipreading,” Interactive Systems Labortories, University of Karlsruhe, Germany and Carnegie Mellon University, U.S.A., pp. 1-4, Apr. 27, 1998.
P.N. Belhumeur et al., “Eigenfaces vs. Fisherfaces: Recognition Using Class Specfic Linear Projection,” IEEE Trans. on PAMI, pp. 1-34, Jul. 1997.
N.R. Garner et al., “Robust Noise Detection for Speech Detection and Enhancement,” IEE, pp. 1-2, Nov. 5, 1996.
H. Ney, “On the Probabilistic Interpretation of Neural Network Classifiers and Discriminative Training Criteria,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17,No. 2,pp. 107-112, Feb. 1995.
L. Wiskott et al., “Recognizing Faces by Dynamic Link Matching,” ICANN '95, Paris, Francis, pp. 347-342, 1995.
A.H. Gee et al., “Determining the Gaze of Faces in Images,” Univeristy of Cambridge, Cambridge, England, pp. 1-20, Mar. 1994.
C. Bregler et al., “Eigenlips For Robust Speech Recognition,” IEEE, pp. II-669-II-672.
C. Benoîl et al., “Which Components of the Face Do Humans and Machines Best Speechread?, ” Institut de la Communication Parlèe, Grenoble, France, pp. 315-328.
Q. Summerfield, “Use of Visual Information for Phonetic Perception,” Visual Information for Phone

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Methods and apparatus for audio-visual speaker recognition... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Methods and apparatus for audio-visual speaker recognition..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for audio-visual speaker recognition... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2445248

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.