Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
2001-11-29
2004-08-31
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S219000, C704S500000
Reexamination Certificate
active
06785645
ABSTRACT:
FIELD OF THE INVENTION
This invention is related, in general, to digital signal processing, and more particularly, to a method and a system of classifying different signal types in multi-mode coding systems.
BACKGROUND OF THE INVENTION
In current multimedia applications such as Internet telephony, audio signals are composed of both speech and music signals. However, designing an optimal universal coding system capable of coding both speech and music signals has proven difficult. One of the difficulties arises from the fact that speech and music are essentially represented by very different signals, resulting in the use of disparate coding technologies for these two signal modes. Typical speech coding technology is dominated by model-based approaches such as Code Excited Linear Prediction (CELP) and Sinusoidal Coding, while typical music coding technology is dominated by transform coding techniques such as Modified Lapped Transformation (MLT) used together with perceptual noise masking. These coding systems are optimized for the different signal types respectively. For example, linear prediction-based techniques such as CELP can deliver high quality reproduction for speech signals, but yield unacceptable quality for the reproduction of music signals. Conversely, the transform coding-based techniques provide excellent quality reproduction for music signals, but the output degrades significantly for speech signals, especially in low bit-rate regimes.
In order to accommodate audio streams of mixed data types, a multi-mode coder that can accommodate both speech and music signals is desirable. There have been a number of attempts to create such a coder. For example, the Hybrid ACELP/Transform Coding Excitation coder and the Multi-mode Transform Predictive Coder (MTPC) are usable to some extent to code mixed audio signals. However, the effectiveness of such hybrid coding systems depends upon accurate classification of the input speech and music signals to adjust the coding mode of the coder appropriately. Such a functional module is referred to as a speech-and-music classifier (hereafter, “classifier”).
In operation, a classifier is initially set to either a speech mode, or a music mode, depending on historical input statistics. Thereafter, upon receiving a sequence of music and speech signals, the classifier classifies the input signal during a particular interval as music or speech, whereupon the coding system is left in, or switched to, the appropriate mode corresponding to the determination of the classifier. While switching of modes in the coder is necessary and desirable when the need to do so is indicated by the classifier, there are disadvantages to switching too readily. Every instance of switching carries with it the possibility of introducing audible artifacts into the reproduced audio signal, degrading the perceived performance of the coder. Unfortunately, prior classification techniques do not provide an efficient solution for avoiding unnecessary switching.
Most current speech/music classifiers are essentially based on classical pattern recognition techniques, including a general technique of feature extraction followed by classification. Such techniques include those described by Ludovic Tancerel et al, in “Combined Speech and Audio Coding by Discrimination,” page 154, Proc. IEEE Workshop on Speech Coding (September 2000), and by Eric Scheirer et al., in “Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator”, Proc. IEEE Int'l Conference Acoustics, Speech, and Signal Processing, page 1331 (April 1997).
Since speech and music signals are intrinsically different, they present disparate signal features, which in turn, may be utilized to discriminate music and speech signals. Examples of prior classification frameworks include Gaussian mixture model, Gaussian model classification and nearest-neighbor classification. These classification frameworks use statistical analyses of underlying features of the audio signal, either in a long or short period of measurement time, resulting in separate long-term and short-term features.
Use of either of these feature sets exclusively presents certain difficulties. For a method based on analysis of long-term features, classification requires a relatively longer measurement period of time. Even though this will likely yield reasonably accurate classification for a frame, long-term features do not allow for a precise localization in time of the switching point between different modes. On the other hand, a method based on analysis of short-term features may provide rapid switching response to frames, but its classification of a frame may not be as accurate as a classification based on a larger sampling.
SUMMARY OF THE INVENTION
The present invention provides an accurate and efficient classification method for use in a multi-mode coder encoding a sequence of speech and music frames for classifying the frames and switching the coder into speech or music mode pursuant to the frame classification as appropriate. The method is especially advantageous for real-time applications such as teleconferencing, interactive network services, and media streaming. In addition to classifying signals as speech or music, the present invention is also usable for classifying signals into more than two signal types. For example, it can be used to classify a signal as speech, music, mixed speech and music, noise, and so on. Thus, although the examples herein focus on the classification of a signal as either speech or music, the invention is not intended to be limited to the examples.
To efficiently and accurately discriminate speech and music frames in a mixed audio signal, a set of features, each of which properly characterizes an essential feature of the signal and presents distinct values for music and speech signals, are selected and extracted from each received frame. Some of the selected features are obtained from the signal spectrum in the frequency domain, while others of the selected features are extracted from the signals in the time domain. Furthermore, some of the selected features utilize variance values to describe the statistical properties of a group of frames.
For each of the frames, long-term and short-term features are estimated. The short-term features are utilized to accurately determine a possible switching time for the coder, while the long-term features are used to accurately classify the frames on a frame-by-frame basis. A predefined switching criterion is applied in determining whether to switch the operation mode of the coder. The predefined switching criterion is defined at least in part, to avoid unexpected and unnecessary switching of the coder, since as discussed above, this may introduce artifacts that audibly degrade the reproduction signal quality.
According to an embodiment, the input sequence of music and speech signals is recorded in a look-ahead buffer followed by a feature extractor. The feature extractor extracts a set of long-term and short-term features from each frame in the buffer. The long-term features and short-term features are then provided to a classification module that first detects a potential switching time according to the short-term features of the current coding frame and the current coding mode of the coder, and then classifies each frame according to the long-term features, and determines whether to switch the operation mode of the coder for the classified frame at the potential switching time according to a predefined switch criterion.
In one embodiment of the invention, the classification for each frame is accomplished by applying a decision tree method with each decision node evaluating a specific selected feature. By comparing the value of the feature with the threshold defined by the node, the decision is propagated down the tree until all the features are evaluated, and a classification decision is thus made. Such a classified frame is then used, in conjunction with one or more frames following it in most cases, in determining whether to switch the operation mode of t
Cuperman Vladimir
Khalil Hosam Adel
Wang Tian
Dorvil Richemond
Harper V. Paul
Leydig , Voit & Mayer, Ltd.
Microsoft Corporation
LandOfFree
Real-time speech and music classifier does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Real-time speech and music classifier, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Real-time speech and music classifier will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3360151