Data processing: speech signal processing – linguistics – language – Audio signal bandwidth compression or expansion
Reexamination Certificate
2000-05-11
2003-04-01
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Audio signal bandwidth compression or expansion
C704S200100, C704S231000, C704S210000
Reexamination Certificate
active
06542869
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a method for identifying changes in an audio signal which may include music, speech, or a combination or music and speech. More particularly, the present invention relates to the identification of changes in the audio for the purpose of indexing, summarizing, beat tracking, or retrieving.
2. Description of the Related Art
With video signals, frame-to-frame differences provide a useful measure of overall changes or novelty in the video signal content. Frame-to-frame differences can be used for automatic segmentation and key frame extraction, as well as for other purposes.
A similar measure for determining significant changes or novelty points in audio might have a number of useful applications. But, computing audio changes or boundaries is significantly more difficult than video. Straightforward approaches like measuring spectral differences are typically not useful because too many false alarms occur, since the typical spectra for speech and music is in constant flux.
A typical approach to audio segmentation is to detect silences. Such a system is disclosed by Arons, B. in “SpeechSkimmer: A system for interactively skimming recorded speech.”
ACM Trans. On Computer Human Interaction
, 4(1):3-38, Match 1997. A procedure for detecting silence works best for speech, even though silences in the speech signal may have little or no semantic significance. Much audio such as popular music or reverberant sources, may contain no silences at all, and the silence based segmentation methods will fail.
Another approach, termed “Auditory Scene Analysis” tries to detect harmonically and temporally related components of sound. Such an approach is described by A. Bregman in “Auditory Scene Analysis: Perceptual Organization of Sound”,
Bradford Books
, 1994. Typically the Auditory Scene Analysis procedure works only in a limited domain, such as a small number of sustained and harmonically pure musical notes. For example, the Bregman approach looks for components in the frequency domain that are harmonically or temporally related. Typically rules are assumptions are used to define what “related” means, and the rules typically work well only in a limited domain.
Another approach uses speaker identification to segment audio by characteristics of an individual. Such a system is disclosed by Siu et al., “An Unsupervised Sequential Learning Algorithm For The Segmentation Of Speech Waveforms With Multiple Speakers”,
Proc. ICASSP
, vol. 2, pp. 189-192, March 1992. Though a speaker identification method could be used to segment music, it relies on statistical models that must be trained from a corpus of labeled data, or estimated by clustering audio segments.
Another approach to audio segmentation operates using musical beat-tracking. In one approach to beat tracking correlated energy peaks across sub-bands are used. See Scheirer, Eric D., “Tempo and Beat Analysis of Acoustic Musical Signals:,
J. Acoust. Soc. Am
. 103(10), pp. 588-601. Another approach depends on restrictive assumptions such as the music must be in 4/4 time and have a bass drum on the downbeat. See, Gogo, M. and Y. Muraoaka, “A Beat Tracking System for Acoustic Signals of Music,” in Proc.
ACM Multimedia
1994, San Francisco, ACM.
SUMMARY OF THE INVENTION
In accordance with the present invention a method is provided to automatically find points of change in music or audio, by looking at local self-similarity. The method can identify individual note boundaries or natural segment boundaries such as verse/chorus or speech/music transitions, even in the absence of cues such as silence.
The present invention works for any audio source regardless of complexity, does not rely on particular acoustic features such as silence, or pitch, and needs no clustering or training.
The method of the present invention can be used in a wide variety of applications, including indexing, beat tracking, and retrieving and summarizing music or audio. The method works with a wide variety of audio sources.
The method in accordance with the present invention finds points of maximum audio change by considering self-similarity of the audio signal. For each time window in the audio signal, a formula, such as a Fast Fourier Transform (FFT), is applied to determine a parameterization value vector. The self-similarity as well as cross-similarity between each of the parameterization values is determined for past and future windows. A significant point of novelty or change will have a high self-similarity in the past and future, and a low cross-similarity. The extent of the time difference between “past” and “future” can be varied to change the scale of the system so that, for example, individual notes can be found using a short time extent while longer events, such as musical themes, can be identified by considering windows further into the past or future. The result is a measure of how novel the source audio is at any time.
Instances when the difference between the self-similarity and cross-similarity measures are large will correspond to significant audio changes, and provide good points for use in segmenting or indexing the audio. Periodic peaks in the difference measurement can correspond to periodicity in the music, such as rhythm, so the method in accordance with the present invention can be used for beat-tracking, that is, finding the tempo and location of downbeats in music. Applications of this method include:
Automatic segmentation for audio classification and retrieval.
Audio indexing/browsing: jump to segment points.
Audio summarization: play only start of significantly new segments.
Audio “gisting:” play only segment that best characterizes entire work.
Align music audio waveforms with MIDI notes for segmentation
Indexing/browsing audio: link/jump to next novel segment
Automatically find endpoints points for audio “smart cut-and-paste”
Aligning audio for non-linear time scale modification (“audio morphing”).
Tempo extraction, beat tracking, and alignment
“Auto DJ” for concatenating music with similar tempos.
Finding time indexes in speech audio for automatic animation of mouth movements
Analysis for structured audio coding such as MPEG-4
The method in accordance with the present invention, thus, produces a time series that is proportional to the novelty of an acoustic source at any instant. High values and peaks correspond to large audio changes, so the novelty score can be thresholded to find instances which can be used as segment boundaries.
REFERENCES:
patent: 5227892 (1993-07-01), Lince
patent: 5598507 (1997-01-01), Kimber et al.
patent: 5655058 (1997-08-01), Balasubramanian et al.
patent: 5659662 (1997-08-01), Wilcox et al.
patent: 5828994 (1998-10-01), Covell et al.
patent: 5918223 (1999-06-01), Blum et al.
patent: 5986199 (1999-11-01), Peevers
patent: 6185527 (2001-02-01), Petkovic et al.
patent: 6370504 (2002-04-01), Zick et al.
ICME 2000. IEEE International Conference on Multimedia and Expo, 2000. Foote, “Automatic audio segmentation using measure of audio novelty”. pp. 452-455 vol. 1, Aug. 2000.*
Applications of Signal Processing to Audio and Acoustics, 1999 IEEE Workshop. Tzanetakis et al., “Multifeature audio segmentation for browsing and annotation”. pp. 103-106. Oct. 1999.*
Kimber, D. and Wilcox, L., “Acoustic Segmentation for Audio Browsers,” inProc. Interface Conference, Sydney, Australia, 1996, 10 pp.
Arons, B., “SpeechSkimmer: A System for Interactively Skimming Recorded Speech,”ACM Trans. on Computer Human Interaction, Mar. 1997, vol. 4, No. 1, pp. 3-38. (http://www.media.mit.edu/~barons/tochi97.html).
Bregman, A. S.,Auditory Scene Analysis: Perceptual Organization of Sound, Bradford Books, 1990.
Eckmann, J.P. et al., “Recurrence Plots of Dynamical Systems,”Europhys. Lett., vol. 4 (9), pp. 973-977, Nov. 1, 1987.
Foote, J., “Content-Based Retrieval of Music and Audio,”SPIE, 1997, vol. 3229, pp. 138-147.
Foote, J., “Visualizing Music and Audio Using Self-Similarity,”ACM Multimedia '9910/99, Orlando, Florida.
Foote, J.T. and Silverman, H.F., “A Mode
Dorvil Richemond
Fliesler Dubb Meyer & Lovejoy LLP
Fuji 'Xerox Co., Ltd.
LandOfFree
Method for automatic analysis of audio including music and... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for automatic analysis of audio including music and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for automatic analysis of audio including music and... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3108325