Multi-feature speech/music discrimination system

Electrical audio signal processing systems and devices – Voice controlled

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S231000, C704S233000

Reexamination Certificate

active

06570991

ABSTRACT:

FIELD OF THE INVENTION
The present invention is directed to the analysis of audio signals, and more particularly to a system for discriminating between different types of audio signals on the basis of whether their content is primarily speech or music.
BACKGROUND OF THE INVENTION
There are a variety of situations in which, upon receiving an audio input signal, it is desirable to label the corresponding sound as either speech or music. For example, some signal compression techniques are more suitable for speech signals, whereas other compression techniques may be more appropriate for music. By automatically determining whether an incoming audio signal contains speech or music information, the appropriate compression technique can be applied. Another potential application for such discrimination relates to automatic speech recognition that is performed on a multi-media sound object, such as a film soundtrack. As a preprocessing step in such an application, the segments of sound which contain speech must first be identified, so that irrelevant segments can be filtered out before the speech recognition techniques are employed. In yet another application, it may be desirable to construct radio receivers that are capable of making decisions about the content of input signals from various radio stations, to automatically switch to a station having desired content and/or mute undesired content.
Depending upon the particular application, the design criteria for an acceptable speech/music discriminator may vary. For example, in a multi-media processing system, the sound analysis can be carried out in a non-real-time manner. Consequently, the processing speeds can be relatively slow. In contrast, for a radio receiver application, real-time analysis is highly desirable, and therefore the discriminator must have low operating latency. In addition, to provide a low-cost product that is accepted by consumers, the memory requirements for the discrimination process should be relatively small. Preferably, therefore, a speech/music discriminator having utility in a variety of different applications should meet the following criteria:
Robustness—the discriminator should be able to distinguish speech from music throughout a broad signal domain. Human listeners are readily able to distinguish speech from music without regard to the language, speaker, gender or rate of speech, and independently of the type of music. An acceptable speech/music discriminator should also be able to reliably perform under these varying conditions.
Low latency—the discriminator should be able to label a new audio signal as being either speech or music as quickly as possible, as well as to recognize changes from speech to music, or vice versa, as quickly as possible, to provide utility in situations requiring real-time analysis.
Low memory requirements—to minimize the cost of devices incorporating the discriminator, the amount of information that is required to be stored at any given time should be as low as possible.
High accuracy—to be truly useful, the discriminator should operate with relatively low error rates.
In the analysis of audio signals to distinguish speech from music, there are two major factors to be considered, namely the types of inherent information in the signal that can be analyzed for speech or music characteristics, and the classification technique that is used to discriminate between speech and music based upon such information. Early generation discriminators utilized only one particular item of information, or feature, of a sound signal to distinguish music from speech. For example, U.S. Pat. No. 2,761,897 discloses a system in which rapid drops in the level of an audio signal are measured. If the number of changes per unit time is sufficiently high, the sound is labeled as speech. In this type of system, the classification technique is based upon simple thresholding, i.e., whether the number of rapid changes per unit time is above or below a threshold value. Other examples of speech/music discriminating devices which analyze a single feature of an audio signal are disclosed in U.S. Pat. Nos. 4,441,203; 4,542,525 and 5,375,188.
More recently, speech/music discrimination techniques have been developed in which more than one feature of an audio signal is analyzed to distinguish between different types of sounds. For example, one such discrimination technique is disclosed in Saunders, “Real-time Discrimination Of Broadcast Speech/Music,”
Proceedings of IEEE ICASSP
, 1996, pages 993-996. In this technique, statistical features which are based upon the zero-crossing rate of an audio signal are computed, and form one set of inputs to a classifier. As a second type of input, energy-based features are utilized. The classifier in this case is a multi-variate Gaussian classifier which separates the feature space into two domains, respectively corresponding to speech and music.
As illustrated by the Saunders article, the accuracy with which an audio signal can be classified as containing either speech or music can be significantly increased by considering multiple features of a sound signal. It is one object of the present invention to provide a speech-music discriminator in which the analysis of an audio signal to classify its sound content is based upon an optimum combination of features for a given environment.
Depending upon the number and type of features that are considered in the analysis of the audio signal, different classification frameworks may exhibit different degrees of accuracy. The primary objective of a multi-variate classifier, which receives multiple type of inputs, is to account for variances between classes of input that can be explained in terms of interactions between the measured features. In essence, every classifier determines a “decision boundary” in the applicable feature space. A maximum a posteriori Gaussian classifier, such as that described in the Saunders article, defines a quadric surface, such as a hyperplane, hypersphere, hyperellipsoid, hyperparaboloid, or the like, between the classes. All data points on one side of this boundary are classified as speech, and all points on the other are considered to be music. This type of classifier may work well in those situations where the data can be readily divided into two distinct clusters, which can be separated by such a simple decision boundary. However, there may be situations in which the dispersion of the data for the different classes is somewhat homogenous within the feature space. In such a case, the Gaussian decision boundary is not as reliable. Accordingly, it is another object of the present invention to provide a speech/music discriminator having a classifier that permits arbitrarily complex decision boundaries to be employed, and thereby increase the accuracy of the discrimination.
SUMMARY OF THE INVENTION
In accordance with one aspect of the present invention, a set of features is provided which can be selectively employed to distinguish speech content from music in an audio signal. In particular, eight different features of a digital audio signal can be measured to analyze the signal. In addition, higher level information is obtained by calculating the variance of some of these features within a predefined time window. More particularly, certain features differ in value between voiced and unvoiced speech. If both types of speech are captured within the time window, the variance will be relatively high. In contrast, music is likely to be constant within the time window, and therefore will have a lower variance value. The differences in the variance values can therefore be employed to distinguish speech sounds from music. By combining data from some of the base features with data from other features, such as the variance features, significant increases in the discrimination accuracy are obtained.
In another aspect of the invention, a “nearest-neighbor” type of classifier is used to distinguish speech data samples from music data samples. Unlike the Gaussian classifier, the nearest-neighbor classifier estimates loc

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Multi-feature speech/music discrimination system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Multi-feature speech/music discrimination system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multi-feature speech/music discrimination system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3003260

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.