Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-08-28
2002-07-09
{haeck over (S)}mits, T{overscore (a)}livaldis Ivars (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S243000
Reexamination Certificate
active
06418412
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to speech recognition and more particularly relates to determining and providing frequency and mean compensated frequency input data to respective quantizer(s) and backend processors to provide efficient and robust speech recognition.
2. Description of the Related Art
Speech is perhaps the most important communication method available to mankind. It is also a natural method for man-machine communication. Man-machine communication by voice offers a whole new range of information/communication services which can extend man's capabilities, serve his social needs, and increase his productivity. Speech recognition is a key element in establishing man-machine communication by voice, and, as such, speech recognition is an important technology with tremendous potential for widespread use in the future.
Voice communication between man and machine benefits from an efficient speech recognition interface. Speech recognition interfaces are commonly implemented as Speaker-Dependent (SD)/Speaker-Independent (SI) Isolated Word Speech Recognition (IWSR)/continuous speech recognition (CSR) systems. The SD/SI IWSR/CSR system provides, for example, a beneficial voice command interface for hands free telephone dialing and interaction with voice store and forwarding systems. Such technology is particularly useful in an automotive environment for safety purposes.
However, to be useful, speech recognition must generally be very accurate in correctly recognizing (classifying) an input signal with a satisfactory probability of accuracy. Difficulty in correct recognition arises particularly when operating in an acoustically noisy environment. Recognition accuracy may be severely, unfavorably impacted under realistic environmental conditions where speech is corrupted by various levels of acoustic noise.
FIG. 1
generally characterizes a speech recognition process by the speech recognition system
100
. A microphone transducer
102
picks up an input signal
101
and provides to signal preprocessor
104
an electronic signal representation of input signal
101
. The input signal
101
is an acoustic waveform of a spoken input, typically a word, or a connecting string of words. The signal preprocessor
104
may, for example, filter the input signal
101
, and a feature extractor
106
extracts selected information from the input signal
101
to characterize the signal using, for example, cepstral frequencies or line spectral pair frequencies (LSPs).
Referring to
FIG. 2
, feature extraction in operation
106
is basically a data-reduction technique whereby a large number of data points (in this case samples of the input signal
101
recorded at an appropriate sampling rate) are transformed into a smaller set of features which are “equivalent”, in the sense that they faithfully describe the salient properties of the input signal
101
. Feature extraction is generally based on a speech production model which typically assumes that the vocal tract of a speaker can be represented as the concatenation of lossless acoustic tubes (not shown) which, when excited by excitation signals, produce a speech signal. Samples of the speech waveform are assumed to be the output of a time-varying filter that approximates the transmission properties of the vocal tract. It is reasonable to assume that the filter has fixed characteristics over a time interval on the order of 10 to 30 milliseconds. The, short-time samples of input signal
101
may be represented by a linear, time-invariant all pole filter designed to model the spectral envelope of the input signal
101
in each time frame. The filter may be characterized within a given interval by an impulse response and a set of coefficients.
Feature extraction in operation
106
using linear predictive (LP) speech production models has become the predominant technique for estimating basic speech parameters such as pitch, formats, spectra, and vocal tract area functions. The LP model allows for linear predictive analysis which basically approximates input signal
101
as a linear combination of past speech samples. By minimizing the sum of the squared differences (over a finite interval) between actual speech samples and the linearly predicted ones, a unique set of prediction filter coefficients can be determined. The predictor coefficients are weighting coefficients used in the linear combination of past speech samples. The LP coefficients are generally updated very slowly with time, for example, every 10-30 milliseconds, to represent the changing states of the vocal tract. LP prediction coefficients are calculated using a variety of well-known procedures, such as autocorrelation and covariance procedures, to minimize the difference between the actual input signal
101
and a predicted input signal
101
. The LP prediction coefficients are often stored as a spectral envelope reference pattern and can be easily transformed into several different representations including cepstral coefficients and line spectrum pair (LSP) frequencies. Details of LSP theory can be found in N. Sugamura, “Speech Analysis and Synthesis Methods Developed at ECL in NTT-from LPC to LSP”, Speech Communication 5, Elsevier Science Publishers, B. V., pp. 199-215 (1986).
Final decision-logic classifier
108
utilizes the extracted feature information to classify the represented input signal
101
to a database of representative input signal
101
. Speech recognition classifying problems can be treated as a classical pattern recognition problem. Fundamental ideas from signal processing, information theory, and computer science can be utilized to facilitate isolated word recognition and simple connected-word sequences recognition.
FIG. 2
illustrates a more specific speech recognition system
200
based on pattern recognition as used in many IWSR type systems. The extracted features representing input signal
101
are segmented into short-term input signal
101
frames and considered to be stationary within each frame for 10 to 30 msec duration. The extracted features may be represented by a D-dimensional vector and compared with predetermined, stored reference patterns
208
by the pattern similarity operation
210
. Similarity between the input signal
101
pattern and the stored reference patterns
208
is determined in pattern similarity operation
210
using well-known vector quantization processes. The vector quantization process yields spectral distortion or distance measures to quantify the score of fitness or closeness between the representation of input signal
101
and each of the stored reference patterns
208
.
The decision rule operation
212
receives the distance measures and determines which of the reference patterns
208
the input signal
101
most closely represents. In a “hard” decision making process, input signal
101
is matched to one of the reference patterns
208
. This one-to-one “hard decision” ignores the relationship of the input signal
101
to all the other reference patterns
208
. Fuzzy methods have been introduced to provide a better match between vector quantized frames of input signal
101
and reference patterns
208
. In a “soft” or “fuzzy” decision making process, input signal
101
is related to one or more reference patterns
208
by weighting coefficients.
Matrix quantization has also been used to introduce temporal information about input signal
101
into decision rule operation
212
. Fuzzy analysis methods have also been incorporated into matrix quantization processes, as described in Xydeas and Cong, “Robust Speech Recognition In a Car Environment”, Proceeding of the DSP95 International Conference on Digital Signal Processing, Jun. 26-28, 1995, Limassol, Cyprus. Fuzzy matrix quantization allows for “soft” decisions using interframe information related to the “evolution” of the short-term spectral envelopes of input signal
101
.
Input signal corruption by acoustical noise has long been responsible for difficulties in input signal recognition accuracy. However, in many environment
Asghar Safdar M.
Cong Lin
Armstrong Angela
Legerity Inc.
Skjerven Morrill & MacPherson LLP
Terrile Stephen A.
{haeck over (S)}mits T{overscore (a)}livaldis Ivars
LandOfFree
Quantization using frequency and mean compensated frequency... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Quantization using frequency and mean compensated frequency..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Quantization using frequency and mean compensated frequency... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2883693