System for using silence in speech recognition

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S251000, C704S252000, C704S254000

Reexamination Certificate

active

06374219

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to computer speech recognition. More particularly, the present invention relates to computer speech recognition performed by conducting a prefix tree search of a silence bracketed lexicon.
The most successful current speech recognition systems employ probabilistic models known as hidden Markov models (HMMs). A hidden Markov model includes a plurality of states, wherein a transition probability is defined for each transition from each state to every state, including transitions to the same state. An observation is probabilistically associated with each unique state. The transition probabilities between states (the probabilities that an observation will transition from one state to the next) are not all the same. Therefore, a search technique, such as a Viterbi algorithm, is employed in order to determine a most likely state sequence for which the overall probability is maximum, given the transition probabilities between states and the observation probabilities.
A sequence of state transitions can be represented, in a known manner, as a path through a trellis diagram that represents all of the states of the HMM over a sequence of observation times. Therefore, given an observation sequence, a most likely path through the trellis diagram (i.e., the most likely sequence of states represented by an HMM) can be determined using a Viterbi algorithm.
In current speech recognition systems, speech has been viewed as being generated by a hidden Markov process. Consequently, HMMs have been employed to model observed sequences of speech spectra, where specific spectra are probabilistically associated with a state in an HMM. In other words, for a given observed sequence of speech spectra, there is a most likely sequence of states in a corresponding HMM.
This corresponding HMM is thus associated with the observed sequence. This technique can be extended, such that if each distinct sequence of states in the HMM is associated with a sub-word unit, such as a phoneme, then a most likely sequence of sub-word units can be found. Moreover, using models of how sub-word units are combined to form words, then using language models of how words are combined to form sentences, complete speech recognition can be achieved.
When actually processing an acoustic signal, the signal is typically sampled in sequential time intervals called frames. The frames typically include a plurality of samples and may overlap or be contiguous. Each frame is associated with a unique portion of the speech signal. The portion of the speech signal represented by each frame is analyzed to provide a corresponding acoustic vectors. During speech recognition, a search is performed for the state sequence most likely to be associated with the sequence of acoustic vectors.
In order to find the most likely sequence of states corresponding to a sequence of acoustic vectors, the Viterbi algorithm is employed. The Viterbi algorithm performs a computation which starts at the first frame and proceeds one frame at a time, in a time-synchronous manner. A probability score is computed for each state in the state sequences (i.e., the HMMs) being considered. Therefore, a cumulative probability score is successively computed for each of the possible state sequences as the Viterbi algorithm analyzes the acoustic signal frame by frame. By the end of an utterance, the state sequence (or HMM or series of HMMs) having the highest probability score computed by the Viterbi algorithm provides the most likely state sequence for the entire utterance. The most likely state sequence is then converted into a corresponding spoken subword unit, word, or word sequence.
The Viterbi algorithm reduces an exponential computation to one that is proportional to the number of states and transitions in the model and the length of the utterance. However, for a large vocabulary, the number of states and transitions becomes large and the computation required to update the probability score at each state in each frame for all possible state sequences takes many times longer than the duration of one frame, which is typically approximately 10 milliseconds in duration.
Thus, a technique called pruning, or beam searching, has been developed to greatly reduce computation needed to determine the most likely state sequence. This type of technique eliminates the need to compute the probability score for state sequences that are very unlikely. This is typically accomplished by comparing, at each frame, the probability score for each remaining state sequence (or potential sequence) under consideration with the largest score associated with that frame. If the probability score of a state for a particular potential sequence is sufficiently low (when compared to the maximum computed probability score for the other potential sequences at that point in time) the pruning algorithm assumes that it will be unlikely that such a low scoring state sequence will be part of the completed, most likely state sequence. The comparison is typically accomplished using a minimum threshold value. Potential state sequences having a score that falls below the minimum threshold value are removed from the searching process. The threshold value can be set at any desired level, based primarily on desired memory and computational savings, and a desired error rate increase caused by memory and computational savings.
Another conventional technique for further reducing the magnitude of computation required for speech recognition includes the use of a prefix tree. A prefix tree represents the lexicon of the speech recognition system as a tree structure wherein all of the words likely to be encountered by the system are represented in the tree structure.
In such a prefix tree, each subword unit (such as a phoneme) is typically represented by a branch which is associated with a particular phonetic model (such as an HMM). The phoneme branches are connected, at nodes, to subsequent phoneme branches. All words in the lexicon which share the same first phoneme share the same first branch. All words which have the same first and second phonemes share the same first and second branches. By contrast, words which have a common first phoneme, but which have different second phonemes, share the same first branch in the prefix tree but have second branches which diverge at the first node in the prefix tree, and so on. The tree structure continues in such a fashion such that all words likely to be encountered by the system are represented by the end nodes of the tree (i.e., the leaves on the tree).
It is apparent that, by employing a prefix tree structure, the number of initial branches will be far fewer than the typical number of words in the lexicon or vocabulary of the system. In fact, the number of initial branches cannot exceed the total number of phonemes (approximately 40-50), regardless of the size of the vocabulary or lexicon being searched. Although if allophonic variations are used, then the initial number of branches could be large, depending on the allophones used.
This type of structure lends itself to a number of significant advantages. For example, given the small number of initial branches in the tree, it is possible to consider the beginning of all words in the lexicon, even if the vocabulary is very large, by evaluating the probability of each of the possible first phonemes. Further, using pruning, a number of the lower probability phoneme branches can be eliminated very early in the search. Therefore, while the second level of the tree has many more branches than the first level, the number of branches which are actually being considered (i e., the number of hypotheses), is also reduced over the number of possible branches.
Speech recognition systems employing the above techniques can typically be classified in two types. The first type is a continuous speech recognition (CSR) system which is capable of recognizing fluent speech. The second type of system is an isolated speech recognition (ISR) system which is typically employed to recognize only isol

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System for using silence in speech recognition does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System for using silence in speech recognition, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System for using silence in speech recognition will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2913286

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.