Pulse or digital communications – Bandwidth reduction or expansion
Reexamination Certificate
1998-02-24
2003-05-06
Kelley, Chris (Department: 2613)
Pulse or digital communications
Bandwidth reduction or expansion
C386S349000, C386S349000, C386S349000, C345S215000, C345S158000, C382S103000
Reexamination Certificate
active
06560281
ABSTRACT:
BACKGROUND OF THE INVENTION
This invention pertains to the art of data processing systems and more particularly to a system for condensing a data frame sequence into a digest comprised of key data frames and frames having desired affordances therein particularly useful for web page publication.
Researchers have been increasingly interested in the problem of browsing and indexing video sequences. By “annotating” such a work is meant extracting of information from still or moving images comprising the work, and then portraying this information in some way that facilitates its access. “Segmentation” refers to detecting key frames of interest with video. “Video summarization” combines both annotation and segmentation. The majority of work to facilitate such browsing and indexing has focused on the detection of key frames and scene breaks in general, unconstrained, video databases. For these methods to be useful on general video sequences they have used simple image processing techniques and do not attempt any high-level analysis of the content of the sequence.
A simple example of a video browsing and indexing system comprises summarizing and annotating data. In prior known versions, this was an off-line or on-line process where indices of the events and the images corresponding to these events are the outputs. This information is used to make a summary web page containing images of each event and their time indices. Clicking on the image, the users can go to a new page with more information about this event. For instance, if the stored item is a video recording of an oral presentation with reference slides, the web page may contain a high resolution image of the slides, so that the users can actually read the words on the slides. Clicking on the time, the monitor in the user's office will start to play the video from this particular time.
Generally speaking, a particular goal of any such automatic video summarization is to be able to save a small set of frames that contain most of the relevant information in the video sequence. That is, one wants to find where the important changes occur. In a restricted domain of overhead presentations a number of “changes” can occur in the image sequence. There are two classes of such changes which are called “nuisances” and “affordances”.
Nuisance changes are those which have no relevant semantic interpretation. Examples of this are when the speaker occludes the slide with their hand or body or when the speaker moves the slide. These nuisance changes should be ignored in an analysis and summarization of the video sequence.
Affordances, on the other hand, are changes in the video sequence that have a semantic interpretation with respect to the presentation. For example, speakers often write, point, or make repetitive gestures at locations on the slide to which they are referring. Another common action is to cover a portion of the slide and gradually reveal the underlying text. These changes are called “affordances” because one can take advantage of them to acquire more information about the presentation. Automatic recognition of the affordances can provide a rich description of the video. Recognition of the affordances will allow the production of annotated key-frames from the video that will allow users to later access portions of the talk where the speaker gestured at a particular location on the slides. Accordingly, it is an object of the invention to provide an improved and novel approach of annotating a meeting video to generate a condensed version of the entire video sequence limited to only the significant display data.
The two main themes explored in previous work on automatic video summarization can be broadly described as segmentation and analysis. Segmentation focuses on finding scene changes or key frames in the video while analysis focuses on understanding actions or events (typically in a more restricted domain). Both types of analysis are important to providing useful video summarization.
Scene-break detection is a first step towards the automatic annotation of digital video sequences. Generally speaking, scene breaks include cuts, an immediate change from one scene to another, dissolves, a gradual change between two scenes; and fades, a gradual change between one screen and a constant image.
There are two basic types of algorithms for scene-break detection. The first uses image-based methods, such as image differencing and color histogramming. The second comprises feature-based method and uses image edge pixels. These algorithms typically compute the differences between two consecutive images and, when the difference is larger than a threshold, there may be a scene break.
Simple image-based differencing tends to over-segment the video sequence when there is motion present in the scene or when the camera is moving since many pixels will change their color frame to frame. Zabih et al., “Video Browsing Using Edges and Motion”, in
CVPR,
pp. 439-446, 1996, have proposed a feature-based method. They detected the appearance of intensity edges that are distant from edges in the previous frame. A global motion computation is used to handle camera or object motion. Such a method can detect and classify scene breaks that are difficult to detect with image-based methods. However, their motion estimation technique (the correlation method and the Hausdorff distance method) can not handle multiple moving objects well. This may result in false negatives from the detector. Generally speaking, the image- and feature-based methods are naive approaches that use straightforward measurements of scene changes. Accordingly, a need exists for a system that can recognize an process multiple moving objects, while minimizing scene breaks to an appropriate minimum.
BRIEF SUMMARY OF THE INVENTION
In accordance with the present invention there is provided a method and apparatus for automatically acquiring an annotated digested description of a technical talk utilizing video images such as overhead transparencies as a part of the talk. A particular advantage of such a system is that it can be used to construct a web page version of the video taped talk which contains only the more relevant information, i.e., such as the speaker's slides and gestures that may be semantically important. The resulting digested talk can be used as a search and index tool for accessing the original audio and video.
More particularly, the subject invention is comprised of a method and apparatus for generating a condensed version of a video sequence suitable for publication as an annotated video on a web page. A video sequence is first recorded and stored as a set of image frames. The image frames are stabilized into a warped sequence of distinct and stationary scene changes, preferably each corresponding to a speaker slide, wherein each scene change is comprised of an associated subset of the image frames. A key frame is generated for each scene change representative of the associated subset. Each key frame is compared with the associate subset for identifying image frames including desired affordances such as semantically significant speaker gestures, e.g. pointing. The condensed version of the video is compiled as an integration of the key frames and the frames with the desired affordance images. Thus, redundant image frames and nuisance variations can be deleted from the original video sequence so that the digest is a much more compact and communicable version of the technical talk. Lastly, the condensed version is annotated to time or an audio for a useful representation of the talk.
In accordance with another aspect of the present invention, the stabilizing step comprises analyzing every two consecutive frames in the video sequence for estimating a global image motion between the consecutive frames. When the estimating includes generating an error computation in excess of a predetermined limit, the excessive error computation is interpreted as a scene change and a demarcation between subsets of the image frames.
In accordance with yet another aspect of the present i
Black Michael J.
Ju Xuan
Kimber Donald G.
Minneman Scott
An Shawn S.
Fay Sharpe Fagan Minnich & McKee LLP
Kelley Chris
LandOfFree
Method and apparatus for generating a condensed version of a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for generating a condensed version of a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for generating a condensed version of a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3043699