Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-04-27
2001-11-20
Dorvil, Richemond (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S233000
Reexamination Certificate
active
06321194
ABSTRACT:
BACKGROUND
This invention relates to identifying a presence of a voice in audio signals, for example, in a telephone network.
An audio signal can be any electronic transmission that conveys audio information. In a telephone network, audio signals include tones (for example, dual tone multifrequency (DTMF) tones, dial tones, or busy signals), noise, silence, or speech signals. Voice detection differentiates a speech signal from tones, noise, or silence.
One use for voice detection is in automated calling systems used for telemarketing. In the past, for example, a company trying to sell goods or services typically used several different telemarketing operators. Each operator would call a number and wait for an answer before taking further action such as speaking to the person on the line or hanging up and calling another prospective buyer. In recent years, however, telemarketing has become more efficient because telemarketers now use automatic calling machines that can call many numbers at a time and notify the telemarketer when someone has picked up the receiver and answered the call. To perform this function, the automatic calling machines must detect a presence of human speech on the receiver amid other audio signals before notifying the telemarketer. The detection of human speech in audio signals can be achieved using digital signal processing techniques.
FIG. 1
is a block diagram of a voice detector
10
that detects a presence of a voice in an audio signal. A time varying input signal
12
is received and a coder/decoder (CODEC)
14
may be used for analog-to-digital (A/D) conversion if the input signal is an analog signal; that is, a signal continuous in time. During A/D conversion, the CODEC
14
periodically samples in time the analog signal and outputs a digital signal
16
that includes a sequence of the discrete samples. The CODEC
14
optionally may perform other coding/decoding functions (for example, compression/decompression). If, however, the input signal
12
is digital, then no A/D conversion is needed and the CODEC
14
may be bypassed.
In either case, the digital signal
16
is provided to a digital signal processor (DSP)
18
which extracts information from the signal using frequency domain techniques such as Fourier analysis. Such frequency-domain representation of audio signals greatly facilitates analysis of the signal. A memory section
20
coupled to the DSP
18
is used by the DSP for storing and retrieving data and instructions while analyzing the digital audio signal
16
.
FIG. 2A
shows an example of a human speech audio signal
22
represented as an analog signal that may be input into the voice detector
10
of FIG.
1
. Furthermore,
FIG. 2B
shows a digital signal
24
that corresponds to the input analog signal after it has been processed by the CODEC
14
. In
FIG. 2B
, the analog signal of
FIG. 2A
has been sampled at a period &Ggr;
26
. Voiced sounds, such as those illustrated in region
28
of
FIGS. 2A and 2B
, generally result in a vibration of the human vocal tract and cause an oscillation in the audio signal. In contrast, unvoiced speech sounds, such as those illustrated in region
30
of
FIGS. 2A and 2B
, generally result in a broad, turbulent (that is, non-oscillatory), and low amplitude signal. The frequency domain representation of the human speech signal of
FIG. 2B
, for example, displays both voiced and unvoiced characteristics of human speech that may be used in the voice detector
10
to distinguish the speech signal from other audio signals such as tones, noise, or silence.
FIG. 3
is a flow chart of operation of the voice detector of FIG.
1
. The voice detector
10
initially determines if the incoming audio signal
12
is digital in format (step
32
). If the audio signal is digital, the voice detector
10
performs a discrete Fourier transform (DFT) analysis on the digitized signal (step
36
). If, however, the audio signal is not digital, then the CODEC
14
samples the audio signal at a specified period to obtain a digital representation
16
of the audio signal (step
34
). Then the voice detector
10
performs a DFT at step
36
.
Parameters, such as frequency-domain maxima, are extracted from the signal (step
38
) and are compared to predetermined thresholds (step
40
). If the parameters exceed the thresholds, the voice detector
10
determines that the audio signal corresponds to a human voice, in which case the voice detector
10
reports the presence of the voice in the audio signal (step
42
).
In step
38
, the parameters extracted from the audio signal, such as the frequency-domain maxima, may, for example, correspond to formant frequencies in speech signals. Formants are natural frequencies or resonances of the human vocal tract that occur because of the tubular shape of the tract. There are three main resonances (formants) of significance in human speech, the locations of which are identified by the voice detector
10
and used in the voice detection analysis. Other parameters may be extracted and used by the voice detector
10
.
Voice detection analysis is complicated by the fact that formant frequencies are sometimes difficult to identify for low-level voiced sounds. Moreover, defining the formants for unvoiced regions (for example, region
30
in
FIGS. 2A and 2B
) is impossible.
SUMMARY
Implementations of the invention may include various combinations of the following features.
In one general aspect, a method of detecting a presence of a voice in an audio signal comprises sampling frequency components of the audio signal during a window that starts when a power of the audio signal reaches a predetermined threshold and stops when the audio signal's power drops below the predetermined threshold. The method further comprises generating an array of elements based on the sampled frequency components, each element of the array corresponding to a time-based sum of frequency components. The method makes a voice detection determination based on one or more values calculated from the generated array. Each value corresponds either to a frequency-based sum of array elements or to the window.
Embodiments may include one or more of the following features.
A value corresponding to a frequency-based sum of array elements may be a ratio of a frequency-based sum of array elements in a lower frequency range and a frequency-based sum of array elements in a higher frequency range. A value corresponding to a frequency-based sum of array elements may be a ration of a maximum-value array element in a lower frequency range and a frequency-based sum of array elements in the lower frequency range other than the maximum-value element.
Prior to sampling, the power of the audio signal may be estimated.
The determining may comprise analyzing the calculated values using fuzzy logic, in which analyzing comprises generating a degree of membership in a fuzzy set for each value. The degree of membership, which may be based on a statistical analysis of audio signals, may represent a measure of a likelihood that the audio signal is a voice. The analyzing may comprise combining degrees of membership for each value into a final value and converting the final value into a voice detection decision. The final value may be converted into a decision by comparing the final value to a predetermined threshold.
The audio signals may occur on a telephone line. Likewise, the audio signals may occur in a computer telephony line.
The methods, techniques, and systems described here may provide one or more of the following advantages. The voice detector is implemented using digital signal processing (DSP) and fuzzy analysis techniques to determine the presence of a voice in an audio signal. The voice detector provides higher reliability and greater simplicity since features are extracted from the averaged spectrum of the incoming signal and fuzzy (as opposed to boolean) logic is employed in the voice detection decision. Furthermore, the voice detector is adaptable since fuzzy logic parameters may be adjusted for different telephone calling location
Brooktrout Technology, Inc.
Dorvil Richemond
Fish & Richardson P.C.
LandOfFree
Voice detection in audio signals does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Voice detection in audio signals, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Voice detection in audio signals will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2575603