Motion video signal processing for recording or reproducing – Local trick play processing – With randomly accessible medium
Reexamination Certificate
2000-03-03
2004-02-24
Chevalier, Robert (Department: 2615)
Motion video signal processing for recording or reproducing
Local trick play processing
With randomly accessible medium
C386S349000
Reexamination Certificate
active
06697564
ABSTRACT:
BACKGROUND
1. Technical Field
This disclosure relates to video browsing and editing, and more specifically, to a video browsing and editing method and apparatus, which is supported by audio browsing and labeling capabilities.
2. Description of the Related Art
Content of a story is mostly included in audio information, i.e. one must listen to the audio to understand content. The common visual signature of a person is their face, and it is more difficult task to detect and match faces from all angles. In addition, the face of the speaker may not appear in a video while he/she is talking. Visual information in most cases is not enough to tell the story, however. Thus, the silent movies were supported by text. Audio is employed in addition to the visual information to enhance our understanding of the video. Continuous speech recognition with natural language processing can play a crucial role in video understanding and organization. Although current speech recognition engines have quite large vocabularies, continuous speaker and environment independent speech recognition is still, for the most part, out of reach, (see, e.g., Y. L. Chang, W. Zeng, I. Kamel, and R. Alonso, “Integrated image and speech analysis for content-based video indexing,” in Proc. of the Int'l Conf. on Multimedia Computing and Systems, (Hiroshima, Japan), pp. 306-313, IEEE Computer Society, Jun. 17-21 1996). Word spotting is a reliable method to extract more information from the speech. Hence, a digital video library system, described by A. Hauptmann and M. Smith in “Text, Speech, and Vision for Video Segmentation: The Informedia™ Project,” applies speech recognition and text processing techniques to obtain the key-words associated with each acoustic “paragraph” whose boundaries are detected by finding silence periods in the audio track. Each acoustic paragraph is matched to the nearest scene break, allowing the generation of an appropriate video paragraph clip in response to a user request. This work has also shown that combining of speech, text and image analysis can provide much more information, thus improving content analysis and abstraction of video compared to using one media only.
There exist many image, text, and audio processing supported methods to understand the content of the video; however, video abstraction is still a very difficult problem. In other words, no automatic system can do the job of a human video cataloger. Automatic techniques for video abstraction are important in the sense that they can make the human cataloger's job easier. However, the cataloger needs tools to correct the mistakes of the automatic system and interpret the content of the audio and image sequences by looking to the efficient representations of these data.
Content of a story is mostly included in speech information, i.e. one must listen the audio to understand the content of the scene. In addition, music is found to play an important role in expressing the director's intention by the way it is used. Music also has the effect of strengthening the impression and the activity of dramatic movies. Hence, speech and music are two important events in the audio track. In addition, speech detection is a pre-processing step to speech recognition and speaker identification.
Traditionally silence has been an important event in telecommunications. Silence detection is well known from telephony. Many of the silence detection techniques rely on computing the power of the signal in small audio frames (usually 10 or 20 msec long portions of the signal) and thresholding. However, segmenting audio into speech and music classes is a new research topic. Recently, several methods are proposed for segregation of audio into classes based on feature extraction, training, and classification.
Audio analysis for video content extraction and abstraction can be extended in several ways. All of the above audio segmentation methods fail when the audio contains multiple sources.
Speaker-based segmentation of the audio is widely used for non-linear audio access. Hence, the speech segments of the audio can be further divided into speaker segments. Although speaker verification and identification problems have been studied in detail in the past, segmentation of the speech into different speakers is a relatively new topic.
In review, audio-visual communication is often more effective than only audio only text based communication. Hence, video, and more particularly, digital video is rapidly becoming integrated into typical computing environments. Abstraction of the video data has become increasingly more important. In addition to video abstraction, video indexing and retrieval has become important due to the immense amount of video stored in multimedia databases. However, automatic systems that can analyze the video and then extract reliable content information from the video automatically have not yet been provided. A human cataloger is still needed to organize video content.
Therefore, a need exists for a video browsing system and tools to facilitate a cataloger's task of analyzing and labeling video. A further need exists for enhancing the existing video browsers by incorporating audio based information spaces.
SUMMARY OF THE INVENTION
A system for browsing and editing video, in accordance with the present invention, includes a video source for providing a video document including audio information and an audio classifier coupled to the video source. The audio classifier is adapted to classify audio segments of the audio information into a plurality of classes. An audio spectrogram generator is coupled to the video source for generating spectrograms for the audio information to check that the audio segments have been identified correctly by the audio classifier. A browser is coupled to the audio classifier for searching the classified audio segments for editing and browsing the video document.
In alternate embodiments, the system may include a pitch computer coupled to the video source for computing a pitch curve for the audio information to identify speakers from the audio information. The pitch curve may indicate speaker changes relative to the generated spectrograms to identify and check speaker changes. The speaker changes are preferably identified in a speaker change list. The speaker change list may be identified in accordance with an audio label list wherein the audio label list stores audio labels for identifying the classes of audio. The system may include a speaker change browser for browsing and identifying speakers in the audio information. The system may further include a memory device for storing an audio label list, and the audio label list may store audio labels associated with the audio information for identifying the audio segments as one of speech, music and silence. The system may further include a video browser adapted for browsing and editing video frames. The video browser may provide frame indices for the video frames, the video frames being associated with audio labels and spectrograms to reference the audio information with the video frames.
A method for editing and browsing a video, in accordance with the present invention, includes providing a video clip including audio, segmenting and labeling the audio into music, silence and speech classes in real-time, determining pitch for the speech class to identify and check changes in speakers and browsing the changes in speaker and the audio labels to associate the changes in speaker and the audio labels with frames of the video clip.
In other methods, the step of segmenting and labeling the audio into music, silence and speech classes in real-time may include the step of computing statistical time-domain features, based on a zero crossing rate and a root mean square energy distribution for audio segments. The audio segments may include a length of about 1.2 seconds. The step of segmenting and labeling the audio into music, silence and speech classes in real-time may include the step of classifying audio segments into music or speech classes based on a simi
Liou Shih-Ping
Toklu Candemir
Chevalier Robert
F. Chau & Associates LLP
Paschburg Donald B.
Siemens Corporate Research Inc.
LandOfFree
Method and system for video browsing and editing by... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for video browsing and editing by..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for video browsing and editing by... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3287845