Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-03-12
2003-04-29
Chawan, Vijay B (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S226000, C704S227000, C704S228000, C704S208000, C704S214000
Reexamination Certificate
active
06556967
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates, in general, to data processing and, in particular, to speech signal processing for identifying voice activity.
BACKGROUND OF THE INVENTION
A voice activity detector is useful for discriminating between speech and non-speech (e.g., fax, modem, music, static, dial tones). Such discrimination is useful for detecting speech in a noisy environment, compressing a signal by discarding non-speech, controlling communication devices that only allow one person at a time to speak (i.e., half-duplex mode), and so on.
A voice activity detector may be optimized for accuracy, speed, or some compromise between the two. Accuracy often means maximizing the rate at which speech is identified as speech and minimizing the rate at which non-speech is identified as speech. Speed is how much time it takes a voice activity detector to determine if a signal is speech or non-speech. Accuracy and speed work against each other. The most accurate voice activity detectors are often the slowest because they analyze a large number of features of the signal using computationally complex methods. The fastest voice activity detectors are often the least accurate because they analyze a small number of features of the signal using computationally simple methods. The primary goal of the present invention is accuracy.
Many prior art voice activity detectors only do a good job of distinguishing speech from one type of non-speech using one type of discriminator and do not do as well if a different type of non-speech is present. For example, the variance of the delta spectrum magnitude is an excellent discriminator of speech vs. music but it not a very good discriminator of speech vs. modem signals or speech vs. tones. Blind combination of specific discriminators does not lead to a general solution of speech vs. non-speech. A dimension reduction technique such as principal components reduction may be used when a large number of discriminators are analyzed in an attempt to compress the data according to signal variance. Unfortunately, maximizing variance may not provide good discrimination.
Over the past few years, several voice activity detectors have been in use. The first of these is a simple energy detection method, which detects increases in signal energy in voice grade channels. When the energy exceeds a threshold, a signal is declared to be present. By requiring that the variance of the energy distribution also exceed a threshold, the method may be used to distinguish speech from several types of non-speech.
FIG. 1
is an illustration of a voice activity detection method called the readability method
1
. It is a variation of the energy method. A signal is filtered
2
by a pre-whitening filter. An autocorrelation
3
is performed on the pre-whitened signal. The peak in the autocorrelated signal is then detected
4
. The peak is then determined to be within the expected pitch range
5
(i.e., speech) or not
6
(i.e., non-speech). Speech is declared to be present if a bulge occurs in the correlation function within the expected periodicity range for the pitch excitation function of speech. The readability method is similar to the energy method since detection is based on energy exceeding a threshold. The readability method
1
performs better that the energy method because the readability method
1
exploits the periodicity of speech. However, the readability method does not perform well if there are changes in the gain, or dynamic range, of the signal. Also, the readability method identifies non-speech as speech when non-speech exhibits periodicity in the expected pitch range (i.e., 75 to 400 Hz.). The pre-whitening filter removes un-modulated tones (i.e., non-speech) to prevent such tones from being identified as speech. However, such a filter does not remove other non-speech signals (e.g., modulated tones and FM signals) which may be present in a channel carrying speech. Such non-speech signals and may be falsely identified as speech.
FIG. 2
is an illustration of the NP method
20
which detects voice activity by estimating the signal to noise ratio (SNR) for each frame of the signal. A Fast Fourier Transform (FFT) is performed on the signal and the absolute value of the result is squared
21
. The result of the last step is then filtered to remove un-modulated tones using a pre-whitening filter
22
. The variance in the result of the last step is then determined
23
. The result of the last step is then limited to a band of frequencies in which speech may occur
24
. The power spectrum of each frame is computed and sorted
25
into either high energy components or low energy components. High energy components are assumed to be signal (speech which may include non-speech) or interference (non-speech) while low energy components are assumed to be noise (all non-speech). The highest energy components are discarded. The signal power is then estimated from the remaining high energy components
26
. The noise power is estimated by averaging the low-energy components
27
. The signal power is then divided by the noise power
28
to produce the SNR. The SNR is then compared to a user-definable threshold to determine whether or not the frame of the signal is speech or non-speech. Signal detection in the NP method is based on a power ratio measurement and is, therefore, not sensitive to the gain of the receiver. The fundamental assumption in the NP method is that spectral components of speech are sparse.
FIG. 3
illustrates a voice activity detector method named TALKATIVE
30
which detects speech by estimating the correlation properties of cepstral vectors. The assumption is that non-stationarity (a good discriminator of speech) is reflected in cepstral coefficients. Vectors of cepstral coefficients are computed in a frame of the signal
31
. Squared Euclidean distances between cepstral vectors are computed
32
. The squared Euclidean distances are time averaged
33
within the frame in order to estimate the stationarity of the signal. A large time averaged value indicates speech while a small time averaged value indicates a stationary signal (i.e., non-speech). The time averaged value is compared to a user-definable threshold
34
to determine whether or not the signal is speech or non-speech. The TALKATIVE method performs well for most signals, but does not perform well for music or impulsive signals. Also, considerable temporal smoothing occurs in the TALKATIVE method.
U.S. Pat. No. 4,351,983, entitled “SPEECH DETECTOR WITH VARIABLE THRESHOLD,” discloses a device for and method of detecting speech by adjusting the threshold for determining speech on a frame by frame basis. U.S. Pat. No. 4,351,983 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 4,672,669, entitled “VOICE ACTIVITY DETECTION PROCESS AND MEANS FOR IMPLEMENTING SAID PROCESS,” discloses a device for and method of detecting voice activity by comparing the energy of a signal to a threshold. The signal is determined to be voice if its power is above the threshold. If its power is below the threshold then the rate of change of the spectral parameters is tested. U.S. Pat. No. 4,672,669 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,255,340, entitled “METHOD FOR DETECTING VOICE PRESENCE ON A COMMUNICATION LINE,” discloses a method of detecting voice activity by determining the stationary or non-stationary state of a block of the signal and comparing the result to the results of the last M blocks. U.S. Pat. No. 5,255,340 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,276,765, entitled “VOICE ACTIVITY DETECTION,” discloses a device for and a method of detecting voice activity by performing an autocorrelation on weighted and combined coefficients of the input signal to provide a measure that depends on the power of the signal. The measure is then compared against a variable threshold to determine voice activity. U.S. Pat. No. 5,276,765 is hereby incorpo
Nelson Douglas J.
Smith David C.
Townsend Jeffrey L.
Chawan Vijay B
Morelli Robert D.
The United States of America as represented by the National Secu
LandOfFree
Voice activity detector does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Voice activity detector, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Voice activity detector will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3101016