Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-06-24
2001-03-27
{haeck over (S)}mits, T{overscore (a)}livaldis I. (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S270000
Reexamination Certificate
active
06208963
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates generally to signal processing systems and methods, and more particularly to signal classification systems.
Automatic signal recognition, such as automatic speech recognition (ASR), by computer is a particularly difficult task. Despite an intensive world-wide research effort for over forty years, existing ASR technology still has many limitations. Moderate success has been achieved for controlled environment, small vocabulary, limited scope applications. Moving beyond these limited applications is difficult because of the complexity of the ASR process.
In the ASR process for a large vocabulary system, the speech input begins as a thought in the speaker's mind and is converted into an acoustic wave by his vocal apparatus. This acoustic wave enters the ASR machine through a transducer/converter which changes the acoustic wave from pressure variations into a representative stream of numbers for subsequent computer processing. This number stream is grouped into successive time intervals or segments (typically 10-20 milliseconds). A feature extraction procedure is applied to each interval. The features are a set of parameters that describe the characteristics of the interval. Their exact definition depends upon the particular ASR method. The features can be used to classify the groups into subword units, usually phonemes. A classification procedure is applied to the resulting sequence to produce words for the text output. This is the general ASR procedure; specific systems vary in the features and classification methods used.
The variation in speakers' acoustic production compounds the classification complexity. Different speakers pronounce sounds differently and at different voice pitches. Even the same sound spoken by the same speaker will vary from instance to instance. In addition, a transducer (such as a microphone), captures and adds to the signal other sources besides the speaker, such as room noise, room echo, equipment noise, other speakers, etc. Grouping the data into time intervals for feature analysis assumes that the signal is stationary throughout the interval with the changes only occurring at the boundaries. This is not strictly true; in fact, the validity of the assumption varies with the type of speech sound. This assumption causes variation in the feature extraction process. Since speech is a continuous process breaking up the sounds into a finite number of subword units will also contribute phonological variation. There is no simple, direct, consistent relationship between the spoken word input and the analysis entities used to identify it.
Generally, there have been three approaches to ASR: acoustic-phonetic, pattern recognition, and artificial intelligence (Fundamentals of Speech Recognition, L. Rabiner and B. H. Juang, Prentice-Hall, Inc., 1993., p. 42). The acoustic-phonetic approach attempts to identify and use features that directly identify phonemes. The features are used to segment and label the speech signal and directly produce a phoneme stream. This approach assumes that a feature set exists such that definitive rules can be developed and applied to accurately identify the phonemes in the speech signal and therefore determine the words with a high degree of certainty. Variance in the speech signal fatally weakens this assumption.
The pattern matching approach has been most successful to date. The features are usually based upon a spectral analysis of speech wave segments. Reference patterns are created for each of the recognition units, usually several for each unit to cover variation. The reference patterns are either templates or some type of statistical model such as a Hidden Markov Model (HMM). An unknown speech segment can be classified by its “closest” reference pattern. Specific implementations differ in use of models versus templates, type of recognition unit, reference pattern creation methods, and classification (or pattern recognition) methods.
Pattern matching ASR systems integrate knowledge from several sources prior to making the final output decision. Many systems typically use a language model. A language model improves recognition by providing additional constraints at the word level; word pair probabilities (bigrams), word triplet probabilities (trigrams), allowable phrases, most likely responses, etc. depending on the application. Knowledge sources can be integrated either bottom up or top down. In the bottom up approach, lower level processes precede higher level processes with the language model applied at the final step. In the top down method, the model generates word hypotheses and matches them against the input speech signal.
The best performing large vocabulary systems to date are top down pattern matchers that use HMMs with Gaussian mixture output distributions to model phonemes. Processing begins when an entire phrase is input. A language model is used to generate candidate phrases. The canonical phonetic pronunciation of each candidate phrase is modeled by connected HMM phonetic models that produce a sequence of feature probability distributions. These distributions are compared to the features of the input speech phrase and the most likely candidate phrase is selected for output. High performance on large vocabularies requires large amounts of computational capacity in both memory and time; real time speech recognition is not currently possible on a desktop system without significant performance compromises. Other drawbacks include sensitivity to the amount of training data, sensitivity of reference patterns to speaking environment and transmission channel characteristics, and non-use of specific speech knowledge.
Artificial intelligence is a collection of implementation techniques rather than a separate ASR approach. They are generally of two types, expert systems and neural networks. Expert systems provide a systematic method to integrate various knowledge sources through development and application of rules. They are best suited for the acoustic-phonetic approach. Neural networks were originally developed to model interactions within the brain. They come in many varieties but they are pattern recognizers which require training to determine network parameter values. They can model non-linear relationships and generalize, that is classify, patterns not in the training data. Neural networks have been successfully used in ASR to classify both phonemes and words.
There is, therefore, a need for a signal processing and classification system that achieves increased performance in time, accuracy, and overall effectiveness. Moreover, there is a need for a signal processing and classification system that provides highly accurate, real-time, speaker independent voice recognition on a desktop computer.
SUMMARY OF THE INVENTION
Methods and apparatus consistent with this invention for signal classification using a network include several steps performed using a multilayer network. The steps include: receiving an input signal feature vector in a first layer; applying a relaxation process that updates an activation value of nodes in the multilayer network for a current time step; outputting a classification.
A multilayer network for signal classification consistent with the present invention includes: a first layer for classifying a first signal feature, wherein the first layer includes structure for receiving an input signal feature vector; a second layer for classifying a second signal feature representing a context of the first signal feature; structure for interaction between the first and second layers; applying a relaxation process that updates an activation value for a node in each of the first and second layers; and structure for generating a signal classification from the first and second classified features according to an activation value of a node in the multilayer network.
Both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.
Hansen Carl Hal
Martinez Tony R.
Moncur R. Brian
Parr Randall J.
Shepherd D. Lynn
Finnegan Henderson Farabow Garrett & Dunner L.L.P.
{haeck over (S)}mits T{overscore (a)}livaldis I.
LandOfFree
Method and apparatus for signal classification using a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for signal classification using a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for signal classification using a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2511838