Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-04-26
2003-05-20
Knepper, David D. (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S273000
Reexamination Certificate
active
06567775
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to speech recognition and speaker identification systems and, more particularly, to methods and apparatus for using video and audio information to provide improved speaker identification.
BACKGROUND OF THE INVENTION
Many organizations, such as broadcast news organizations and information retrieval services, must process large amounts of audio information, for storage and retrieval purposes. Frequently, the audio information must be classified by subject or speaker name, or both. In order to classify audio information by subject, a speech recognition system initially transcribes the audio information into text for automated classification or indexing. Thereafter, the index can be used to perform query-document matching to return relevant documents to the user. The process of classifying audio information by subject has essentially become fully automated.
The process of classifying audio information by speaker, however, often remains a labor intensive task, especially for real-time applications, such as broadcast news. While a number of computationally-intensive off-line techniques have been proposed for automatically identifying a speaker from an audio source using speaker enrollment information, the speaker classification process is most often performed by a human operator who identifies each speaker change, and provides a corresponding speaker identification.
Humans may be identified based on a variety of attributes of the person, including acoustic cues, visual appearance cues and behavioral characteristics, such as characteristic gestures or lip movements. In the past, machine implementations of person identification have focused on single techniques relating to audio cues alone (for example, audio-based speaker recognition), visual cues alone (for example, face identification or iris identification) or other biometrics. More recently, however, researchers have attempted to combine multiple modalities for person identification, see, e.g., J. Bigun, B. Duc, F. Smeraldi, S. Fischer and A. Makarov, “Multi-Modal Person Authentication,” In H. Wechsler, J. Phillips, V. Bruce, F. Fogelman Soulie, T. Huang (eds.) Face Recognition: From Theory to Applications, Berlin Springer-Verlag, 1999. U.S. patent application Ser. No. 09/369,706, filed Aug. 6, 1999, entitled “Methods and Apparatus for Audio-Visual Speaker Recognition and Utterance Verification,” assigned to the assignee of the present invention, discloses methods and apparatus for using video and audio information to provide improved speaker recognition.
Speaker recognition is an important technology for a variety of applications including security applications and such indexing applications that permit searching and retrieval of digitized multimedia content. Indexing systems, for example, transcribe and index audio information to create content index files and speaker index files. The generated content and speaker indexes can thereafter be utilized to perform query-document matching based on the audio content and the speaker identity. The accuracy of such indexing systems, however, depends in large part on the accuracy of the identified speaker. The accuracy of currently available speaker recognition systems, however, requires further improvements, especially in the presence of acoustically degraded conditions, such as background noise, and channel mismatch conditions. A need therefore exists for a method and apparatus that automatically transcribes audio information and concurrently identifies speakers in real-time using audio and video information. A further need exists for a method and apparatus for providing improved speaker recognition that successfully perform in the presence of acoustic degradation, channel mismatch, and other conditions which have hampered existing speaker recognition techniques. Yet another need exists for a method and apparatus for providing improved speaker recognition that integrates the results of speaker recognition using audio and video information.
SUMMARY OF THE INVENTION
Generally, a method and apparatus are disclosed for identifying the speakers in an audio-video source using both audio and video information. The disclosed audio transcription and speaker classification system includes a speech recognition system, a speaker segmentation system, an audio-based speaker identification system and a video-based speaker identification system. The audio-based speaker identification system identifies one or more potential speakers for a given segment using an enrolled speaker database. The video-based speaker identification system identifies one or more potential speakers for a given segment using a face detector/recognizer and an enrolled face database. An audio-video decision fusion process evaluates the individuals identified by the audio-based and video-based speaker identification systems and determines the speaker of an utterance in accordance with the present invention.
In one implementation, a linear variation is imposed on the ranked-lists produced using the audio and video information by: (1) removing outliers using the Hough transform; and (2) fitting the surviving points set to a line using the least mean squares error method. Thus, the ranked identities output by the audio and video identification systems are reduced to two straight lines defined by:
audioScore=
m
1
×rank+
b
1
; and
videoScore=
m
2
×rank+
b
2
.
The decision fusion scheme of the present invention is based on a linear combination of the audio and the video ranked-lists. The line with the higher slope is assumed to convey more discriminative information. The normalized slopes of the two lines are used as the weight of the respective results when combining the scores from the audio-based and video-based speaker analysis.
The weights assigned to the audio and the video scores affect the influence of their respective scores on the ultimate outcome. According to one aspect of the invention, the weights are derived from the data itself. With w
1
and w
2
representing the weights of the audio and the video channels, respectively, the fused score, FS
k
, for each speaker is computed as follows:
w
1
=
m
1
m
1
+
m
2
⁢
⁢
and
⁢
⁢
w
2
=
m
2
m
1
+
m
2
FS
k
=W
1
(
m
1
×rank
k
+b
1)
+w
2
(
m
2
×rank
k
+b
2
).
where rank
k
is the rank for speaker k.
REFERENCES:
patent: 4449189 (1984-05-01), Feix et al.
patent: 5226092 (1993-07-01), Chen
patent: 5659662 (1997-08-01), Wilcox et al.
patent: 6028960 (2000-02-01), Graf et al.
patent: 6460127 (2002-10-01), Akerib
S. Dharanipragada et al., “Experimental Results in Audio Indexing,” Proc. ARPA SLT Workshop, (Feb. 1996), 4 pages.
L. Polymenakos et al., “Transcription of Broadcast News—Some Recent Improvements to IBM's LVCSR System,” Proc. ARPA SLT Workshop, (Feb. 1996), 4 pages.
R. Bakis, “Transcription of Broadcast News Shows with the IBM Large Vocabulary Speech Recognition System,” Proc. ICASSP98, Seattle, WA (1998), 6 pages.
H. Beigi et al., “A Distance Measure Between Collections of Distributions and its Application to Speaker Recognition,” Proc. ICASSP98, Seattle, WA (1998), 4 pages.
S. Chen, “Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion,” Proceedings of the Speech Recognition Workshop (1998), 6 pages.
S. Chen et al., “Clustering via the Bayesian Information Criterion with Applications in Speech Recognition,” Proc. ICASSP98, Seattle, WA (1998), 4 pages.
S. Chen et al., “IBM's LVCSR System for Transcription of Broadcast News Used in the 1997 Hub4 English Evaluation,” Proceedings of the Speech Recognition Workshop (1998), 6 pages.
S. Dharanipragada et al., “A Fast Vocabulary Independent Algorithm for Spotting Words in Speech,” Proc. ICASSP98, Seattle, WA (1998), 4 pages.
J. Navratil et al., “An Efficient Phonotactic-Acoustic system for Language Identification,” Proc. ICASSP98, Seattle, WA (1998), 4 pages.
G. N. Ramaswamy et al., “Compression of Acoustic F
Maali Fereydoun
Viswanathan Mahesh
Dang Thu Ann
International Business Machines - Corporation
Knepper David D.
Ryan & Mason & Lewis, LLP
LandOfFree
Fusion of audio and video based speaker identification for... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Fusion of audio and video based speaker identification for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fusion of audio and video based speaker identification for... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3037566