Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1997-10-20
2001-04-10
Voeltz, Emanuel Todd (Department: 2761)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
Reexamination Certificate
active
06216103
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates generally to electronic speech recognition systems and relates more particularly to a method for implementing a speech recognition system for use during conditions with background noise.
2. Description of the Background Art
Implementing an effective and efficient method for system users to interface with electronic devices is a significant consideration of system designers and manufacturers. Human speech recognition is one promising technique that allows a system user to effectively communicate with selected electronic devices, such as digital computer systems. Speech typically consists of one or more spoken utterances which each may include a single word or a series of closely-spaced words forming a phrase or a sentence. In practice, speech recognition systems typically determine the endpoints (the beginning and ending points) of a spoken utterance to accurately identify the specific sound data intended for analysis. Conditions with significant ambient background-noise levels present additional difficulties when implementing a speech recognition system. Examples of such conditions may include speech recognition in automobiles or in certain manufacturing facilities. In such user applications, in order to accurately analyze a particular utterance, a speech recognition system may be required to selectively differentiate between a spoken utterance and the ambient background noise.
Referring now to
FIG. 1
, a diagram of speech energy
110
from an exemplary spoken utterance is shown. In
FIG. 1
, speech energy
110
is shown with time values displayed on the horizontal axis and with speech energy values displayed on the vertical axis. Speech energy
110
is shown as a data sample which begins at time
116
and which ends at time
118
. Furthermore, the particular spoken utterance represented in
FIG. 1
includes a beginning point t
s
which is shown at time
112
and also includes an ending point t
e
which is shown at time
114
.
In many speech detection systems, the system user must identify a spoken utterance by manually indicating the beginning and ending points with a user input device, such as a push button or a momentary switch. This “push-to-talk” system presents serious disadvantages in applications where the system user is otherwise occupied, such as while operating an automobile in congested traffic conditions. A system that automatically identifies the beginning and ending points of a spoken utterance thus provides a more effective and efficient method of implementing speech recognition in many user applications.
Some speech-recognition systems determine the beginning and ending points of a spoken utterance by using non-real time analysis techniques. For example, a speech-recognition system may first capture all the speech energy
110
corresponding to a particular utterance starting at time
116
and ending at time
118
. Then, the non-real time system may subsequently process the captured speech energy
110
to determine beginning point t
s
at time
112
and ending point t
e
at time
114
. The non-real time system thus delays the calculation of the beginning and ending points until the entire utterance is captured and processed. In contrast, a system which continually recalculates and updates beginning and ending points in real-time as speech energy
110
is being acquired may provide a more responsive and flexible method for implementing a speech recognition system.
Speech recognition systems use many different speech parameters, including amplitude, short-term auto-correlation coefficients, zero-crossing rates, linear prediction error and harmonic analysis. In spite of attempts to select speech parameters that effectively and accurately allow the detection of human speech, robust speech detection under conditions of significant background noise remains a challenging problem. A system that selects and utilizes effective speech parameters to perform robust speech detection in conditions with background noise may thus provide a more useful and powerful method of speech recognition. Therefore, for all the foregoing reasons, an improved method is needed for implementing a speech recognition system for use during conditions with background noise.
SUMMARY OF THE INVENTION
In accordance with the present invention, a method is disclosed for implementing a speech recognition system for use during conditions with background noise. The invention includes a feature extractor within the speech recognition system that receives digital speech data corresponding to a spoken utterance. Within the feature extractor, a filter bank receives the speech data and responsively generates channel energy which is provided to an endpoint detector. The channel energy from the filter bank in the feature extractor is also provided to a feature vector calculator which generates feature vectors that are then provided to a recognizer.
In accordance with the present invention, the endpoint detector analyzes the channel energy received from the feature extractor and responsively determines endpoints (beginning and ending points) for the particular spoken utterance represented by the channel energy. In practice the endpoint detector performs the fundamental steps of first detecting a reliable island in the speech energy from the spoken utterance, and then refining the boundaries (beginning and ending points) of the spoken utterance. The present invention repeatedly recalculates short-term delta energy parameters (DTF parameters) and threshold values in real time, as speech energy is processed by the endpoint detector. In the preferred embodiment, the starting point of the reliable island (t
sr
) is detected when the current DTF(i) parameter is first greater than a threshold T
sr
for at least five frames. In the preferred embodiment, the stopping point of the reliable island (t
er
) is detected when the current DTF(i) value is less than a threshold T
er
for at least 60 frames (600 milliseconds) or less than a threshold T
e
for at least 40 frames (400 milliseconds).
After the starting point t
sr
of the reliable island is detected, a backward-searching (or refinement) procedure is used to find the beginning point t
s
of the spoken utterance. In the preferred embodiment, the searching range for this refinement procedure is limited to thirty five frames (350 milliseconds) from the starting point t
sr
of the reliable island. The beginning point t
s
of the utterance is preferably found when the current DTF(i) parameter is less than a beginning threshold T
s
for at least seven frames. Similarly, the ending point t
e
of the spoken utterance may preferably be found when the current DTF(i) parameter is less than an ending threshold T
e
for a predetermined number of frames.
The endpoint detector provides the identified endpoints (beginning and ending points of the spoken utterance) to the recognizer and may also, under certain error conditions, provide a restart signal to the recognizer. The recognizer responsively utilizes the feature vectors and the endpoints to perform a speech recognition procedure and advantageously generate a speech recognition result, in accordance with the present invention. The present invention thus efficiently and effectively implements a speech recognition system for use during conditions with background noise.
REFERENCES:
patent: Re. 32172 (1986-06-01), Johnston et al.
patent: 4696041 (1987-09-01), Sakata
patent: 4821325 (1989-04-01), Martin et al.
patent: 5305422 (1994-04-01), Junqua
Parsons. Voice and Speech Processing. McGraw-Hill, Inc. New York. pp. 295-297., 1987.*
Deller et al. Discrete Time Processing of Speech Signals. Macmillan Publishing Company. new York. pp. 224-251., 1993.*
Rabiner et al. Fundamentals of Speech Recognition. Prentice Hall. New Jersey. pp. 143-149., 1993.*
Rangoussi et al. On the Use of Higher Order Statistics for Robust Endpoint Detection of Speech. IEEE Signal Processing Workshop on Higher Order Statistics. pp. 56-60, 1993.*
Jean-Claude Junqua, Brain Mak, and Ben
Chen Ruxin
Olorenshaw Lex
Tanaka Miyuki
Wu Duanpei
Koerner Gregory J.
Simon & Koerner LLP
Sofocleous M. David
Sony Corporation
Todd Voeltz Emanuel
LandOfFree
Method for implementing a speech recognition system to... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for implementing a speech recognition system to..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for implementing a speech recognition system to... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2508542