Morphological pure speech detection using valley percentage

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S210000, C704S213000

Reexamination Certificate

active

06205422

ABSTRACT:

TECHNICAL FIELD
The invention relates to human speech detection by a computer, and more specifically relates to detecting pure-speech signals in an audio signal that may contain both pure-speech and mixed-speech or non-speech signals.
BACKGROUND OF THE INVENTION
Sounds typically contain a mix of music, noise, and/or human speech. The ability to detect human speech in sounds has important applications in many fields such as digital audio signal processing, analysis and coding. For example, specialized codecs (compression/decompression algorithms) have been developed for more efficient compression of pure sounds containing either music or speech, but not both. Most digital audio signal applications, therefore, use some form of speech detection prior to application of a specialized codec to achieve more compact representation of an audio signal for storage, retrieval, processing or transmission.
However, accurate detection of human speech by a computer in an audio signal produced by sounds containing a mix of music, noise and speech is not an easy task. Most existing speech detection methods use spectral and statistical analyses of the waveform patterns produced by the audio signal. The challenge is to identify features of the waveform patterns that reliably distinguish the pure-speech signals from the non-speech or mixed-speech signals.
For example, some existing methods of speech detection take advantage of a particular feature known as the zero-crossing rate (ZCR). See J. Saunders, “Real-time Discrimination of Broadcast Speech/Music”, Proc. ICASSP'96, pp. 993-996, 1996. The ZCR feature provides a weighted average of the spectral energy distribution in the waveform. Human speech typically produces audio signals having a high ZCR, whereas other sounds, such as noise or music, do not. However, this feature may not always be reliable, as in the case of the sound of highly percussive music or structured noise, which can produce audio signals that have ZCRs indistinguishable from those of human speech.
Other existing methods employ several features, including the ZCR feature, in conjunction with elaborate statistical feature analysis, in an attempt to improve the accuracy of speech detection. See J. D. Hoyt and H. Wechsler, “Detection of Human Speech in Structured Noise”, Proc. ICASSP'94, Vol. 11, 237-240, 1994; E. Scheirer and M. Slaney, “Construction and Evaluation of A Robust Multifeature Speech/Music Discriminator”, Proc. ICASSP'97, 1997.
While a great deal of research has focused on human speech detection, all of these existing methods fail to satisfy one or more of the following desirable characteristics of a speech detection system for modern multimedia applications: high precision, robustness, short time delay and low complexity.
High precision is desirable in digital audio signal applications because it is important to determine the nearly “exact” time when the speech starts and stops, or the boundaries, accurate to within less than a second. Robustness is desirable so that the speech detection system can process audio signals containing a mixture of sounds including noise, music, song, conversation, commercials, etc., all of which may be sampled at different rates without human intervention. Moreover, most digital audio signal applications are real-time applications. Thus, it is advantageous if the speech detection method employed provides results within a few seconds and with as little complexity as possible, for real-time implementation at a reasonable cost.
SUMMARY OF THE INVENTION
The invention provides an improved method for detecting human speech in an audio signal. The method employs a novel feature of the audio signal, identified as the Valley Percentage (VP) feature, that distinguishes the pure-speech signals from the non-speech and mixed-speech signals more accurately than existing known features. While the method is implemented in software program modules, it can also be implemented in digital hardware logic or in a combination of hardware and software components.
An implementation of the method operates on consecutive audio samples in a stream of samples by viewing a predetermined number of samples through a moving window of time. A Feature Computation component computes the value of the VP at each point in time by measuring the low energy parts of the audio signal (the valley) in comparison to the high energy parts of the audio signal (the mountain) for a particular audio sample relative to the surrounding audio samples in a given window. Intuitively, the VP is like the valley area among mountains. The VP is very useful in detecting pure-speech signals from non-speech or mixed-speech signals, because human speech tends to have a higher VP than other types of sounds such as music or noise.
After the initial window of samples is processed, the window is repositioned at (advanced to) the next consecutive audio sample in the stream. The Feature Computation component repeats the computation of the VP, this time using the next window of audio samples in the stream. The process of repositioning and computation is reiterated until a VP has been computed for each sample in the audio signal. A Decision Processor component classifies the audio samples into pure-speech or non-speech classifications by comparing the computed VP values against a threshold VP value.
In actual practice, human speech usually lasts for at least more than a few continuous seconds in real-world digital audio data. Thus, the accuracy of the speech detection is generally improved by removing those isolated audio samples classified as pure-speech, but whose neighboring samples are classified as non-speech, and vice versa. However, at the same time, it is desirable to preserve the sharp boundary between the speech and non-speech segments.
In the implementation, a Post-Decision Processor component accomplishes the foregoing by applying a filter to the binary speech decision mask (containing a string of “1”s and “0”s) generated by the Decision Processor component. Specifically, the Post-Decision Processor component applies a morphological opening filter followed by a morphological closing filter to the binary decision mask values. The result is the elimination of any isolated pure-speech or non-speech mask values (elimination of the isolated “1”s and “0”s). What remains is the desired speech detection mask identifying the boundaries of the pure-speech and non-speech portions of the audio signal.
Implementations of the method may include other features to improve the accuracy of the speech detection. For example, the speech detection method preferably includes a Pre-Processor component to clean the audio signal by filtering out unwanted noise prior to computing the VP feature. In one implementation, a Pre-Processor component cleans the audio signal by first converting the audio signal to an energy component, and then applying a morphological closing filter to the energy component.
The method implements human speech detection efficiently in audio signals containing a mix of music, speech and noise, regardless of the sampling rate. For superior results, however, a number of parameters governing the window sizes and threshold values may be implemented by the method. Although there are many alternatives to determining these parameters, in one implementation, such as in supervised digital audio signal applications, the parameters are pre-determined by training the application a priori. A training audio sample with a known sampling rate and known speech boundaries is used to fix the optimal values of the parameters. In other implementations, such as implementation in an unsupervised environment, adaptive determination of these parameters is possible.
Further advantages and features of the invention will become apparent in the following detailed description and accompanying drawings.


REFERENCES:
patent: 4063033 (1977-12-01), Harbert et al.
patent: 4281218 (1981-07-01), Chuang et al.
patent: 4628529 (1986-12-01), Borth et al.
patent: 4630304 (1986-12-01), Borth et al.
patent: 4975657 (1990-12-

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Morphological pure speech detection using valley percentage does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Morphological pure speech detection using valley percentage, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Morphological pure speech detection using valley percentage will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2476092

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.