Methods and apparatuses for segmenting an audio-visual...

Image analysis – Pattern recognition – Classification

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C382S159000, C382S173000, C382S190000, C382S209000, C382S225000, C382S227000, C382S276000, C382S305000, C382S197000, C348S480000, C348S484000, C358S403000, C704S243000, C704S239000, C704S245000, C707S793000, C707S793000

Reexamination Certificate

active

06404925

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of processing audio-visual recordings for the purpose of automatically indexing the recordings according to content. Specifically, the present invention pertains to the field of finding segments in recorded meetings that correspond to individual oral presentations.
2. Discussion of the Related Art
Conventional approaches were concerned with segmenting audio only, thus there was no video channel to exploit. Using uniform-duration windows to provide initial data for speaker clustering has been attempted. This led to problems with the initial segmentation, as only short windows at arbitrary times could be used for the initial clustering. If windows were too long, then chances of capturing multiple speakers were high, however too short a window resulted in insufficient data for good clustering. In the absence of additional cues, windows often overlapped a change in speaker, making them less useful for clustering. Most conventional segmentation work has also based primarily on audio, for example, meeting segmentation using speech recognition from close-talking lapel microphones.
SUMMARY OF THE INVENTION
Many meetings contain slide presentations by one or more speakers, for example, weekly staff meetings. These meetings are often recorded on audio-visual recording media for future review and reuse. For browsing and retrieval of the contents of such meetings, it is useful to locate the time extent, for example the start and end time, of each individual oral presentation within the recorded meetings.
According to the present invention, automatic image recognition provides cues for audio-based speaker identification to precisely segment individual presentations within video recordings of meetings. Video transform feature vectors are used to identify video frames intervals known to be associated with single-speaker audio intervals. The audio intervals are used to train a speaker recognition system for audio segmentation of the audio-visual recording.
In a presently preferred embodiment, it is assumed that single-speaker oral presentations includes intervals where slides are being displayed, and that a particular speaker will speak for the entire duration that the each slide is being displayed. Single-speaker regions of the audio-visual recording are identified in the video by searching for extended intervals of slide images. Slide intervals are automatically detected, and the audio in these regions is used to train an audio speaker spotting system. A single speaker's presentation may also contain camera shots of the speaker and audience. Since a presentation by a given speaker can span multiple slide intervals, the audio intervals corresponding to the slide intervals are clustered by audio similarity to find the number and order of the speakers giving presentations in the video. After clustering, all audio data from a single speaker is used to train a source-specific speaker model for identifying a specific speaker from the audio portion of the audio-visual recording. The audio is then segmented using the speaker spotting system, yielding a sequence of single speaker intervals for indexing against the audio-visual recording.
Alternatively, video analyses searching for members of video image classes other than slides, such as single-face detection, or detecting a person standing in front of a podium, instead of slide detection, are used to detect intervals for which the audio comes from a single speaker. Any detectable feature in the video known to be associated with a single speaker can be used according to the present invention. In general, any video image class known to be correlated with single-speaker audio intervals is used to detect intervals for which the audio comes from a single speaker.
In an alternative embodiment, face recognition detects frame intervals corresponding to each speaker. In this embodiment, face recognition of a specific speaker associates video intervals of that speaker to audio intervals from that speaker. Thus, face recognition replaces the audio clustering method of the preferred embodiment which distinguishes audio intervals from different speakers. Face recognition associates recognized frames with speech from one speaker. For example, first and second video image classes corresponding to first and second speakers are used to detect frame intervals corresponding to first and second speakers, respectively.
According to the present invention, regions of recorded meetings corresponding to individual presentations are automatically found. Once presentations have been located, the region information may be used for indexing and browsing the video. In cases where there is an agenda associated with the meeting, located presentations can be automatically labeled with information obtained from the agenda. This allows presentations to be easily found by presenter and topic.
The methods of the present invention are easily extended to across multiple meeting videos and to other domains such as broadcast news. These and other features and advantages of the present invention are more fully described in the Detailed Description of the Invention with reference to the Figures.


REFERENCES:
patent: 5598507 (1997-01-01), Kimber et al.
patent: 5659662 (1997-08-01), Wilcox et al.
patent: 5664227 (1997-09-01), Mauldin et al.
patent: 5806030 (1998-09-01), Junqua
patent: 5872865 (1999-02-01), Normile et al.
patent: 5875425 (1999-02-01), Nakamura et al.
patent: 6009392 (1999-12-01), Kanevsky et al.
patent: 6073096 (2000-06-01), Gao et al.
Nam, et al. (emphasis added) “Speaker Identification and Video Analysis for Hierarchical Video Shot Classification”, Jul. 1997, IEEE, pp. 550-553.*
Boreczky, John S. and Wilcox, Lynn D., “A Hidden Markov Model Framework for Video Segmentation Using Audio and Image Features,” FX Palo Alto laboratory, Palo Alto, CA 94304 USA.InProc. ICASSP '98, IEEE, May 1998, Seattle USA.
Foote, Jonathan,An Overview of Audio Information Retrieval, FX Palo Alto Laboratory, Palo Alto, CA 93404, USA, Dec. 18, 1997.
Foote, Jonathan, Boreczky; John; Girgensohn, Andreas and Wilcox, Lynn “An Intelligent Media Browser Using Automatic Multimodal Analysis,” InProc. ACM Multimedia, pp375-380, Bristol UK Sep. 1998.
Vasconcelos, Nuno and Lippman, Andrew “A Bayesian Framework for Semantic Content Characterization,”In Proc. CVPR '98, Santa Barbara, 1998.
Iyengar, Giridharan and Lippman, Andrew “Models for Automatic Classification of Video Sequences,” InProc SPIE, Storage and Retrieval for Image and Video Databases IV, vol. 3312, pp. 216-227.
Wolf, Wayne “Hidden Markov Model Parsing of Video Programs,”In Proc. ICASSP '97, vol. 4, pp. 2609-2612, IEEE, Apr. 1997.
Wilson, Andrew D.; Bobick, Aaron F.and Cassell, Justine; “Recovering the Temporal Structure of Natural Gesture, ” inProc. Second Int. Conf. On automatic Face and Gesture Recognition, Oct., 1996 (Also MIT Media Laboratory Technical Report No. 338).
Arman, Farshid; Hsu, Arding and Chiu, Ming-Yee, “Image Processing On Encoded Video Sequences,” inMultimedia Systems(1994) vol. 1, No. 5, pp. 211-219.
Mohan, Rakesh, “Video Sequence Matching,” inProc. ICASSP'98, IEEE, May 1998, Seattle.
Chang, Shih-Fe; Chen, William; Meng, Horace J.; Sundaram, Hari; and Zhong, Di, “VideoQ: An Automated Content Based Video Search System Using Visual Cues” inProc. ACMMultimedia, Nov. 1997, Seattle, WA.
Swain, Michael J.; “Interactive Indexing Into Image Databases,” inProc. SPIE vol. 1908, Storage and Retrieval For Image and Video Databases, pp. 95-103, Feb. 1993.
Faloutsos, C.; Equitz, W.; Flickner, M./ Niblack, W.; Petkovic, D.; and Barber R.;Efficient and Effective Querying by Image Content, in M. Maybury, ed.,Intelligent Multimedia Information Retrieval, AAAI Press/MIT Press, 1997.
Uchihashi, Shingo; Wilcox, Lynn; Chiu, P., Cass T., FXPAL-IP-98-012 “Automatic Index Creation For Handwritten Notes”.
Foote, Jonathan T. “Content-Based Retrieval of Music and Audio,” in C.-C.J. Kuo et al.Multimedia Storage and Archiving Systems I

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Methods and apparatuses for segmenting an audio-visual... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Methods and apparatuses for segmenting an audio-visual..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatuses for segmenting an audio-visual... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2942504

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.