Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-04-29
2001-02-06
Hudspeth, David R. (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S256000
Reexamination Certificate
active
06185528
ABSTRACT:
FIELD OF THE INVENTION
The invention relates to automatic speech recognition systems and, in particular, to a method and a device for isolated word recognition in large vocabularies, wherein words are represented through a combination of acoustic-phonetic units of the language and wherein recognition is effected through two sequential steps in which the techniques of neural networks and Markov models are respectively used, and the results of both techniques are adequately combined so as to improve recognition accuracy.
BACKGROUND OF THE INVENTION
Neural networks are a parallel processing structure reproducing the cerebral cortex organization in very simplified form. A neural network is formed by numerous processing units, called neurons, strongly interconnected through links of different intensity, called synapses or interconnection weights. Neurons are generally organized according to a layered structure, comprising an input layer, one or more intermediate layers and an output layer. Starting from the input units, which receive the signal to be processed, processing propagates to the subsequent layers in the network up to the output units that provide the result. Various implementations of neural networks are described, for example, in the book by D.Rumelhart “Parallel Distributed Processing”, Vol. 1—Foundations, MIT Press, Cambridge, Mass., 1986.
Neural network technique is applicable to many sectors and in particular to speech recognition, where a neural network is used to estimate probability P (Q|X) of a phonetic unit Q, given the parametrin representation X of a portion of the input speech signal. Words to be organized are represented as a concatenation of phonetic units and a dynamic programming algorithm is used to identify the word having the highest probability to be that being uttered.
Hidden Markov models are a classical speech recognition technique. A model of this type is formed by a number of states interconnected by the possible transitions. Transitions are associated with a probability of passing from the origin state to the destination state. Further, each state may emit symbols of a finite alphabet, according to a given probability distribution. In the case of speech recognition, each model represents an acoustic-phonetic unit through a left-to-right automaton in which it is possible to remain in each state with a cyclic transition or to pass to the next state. Furthermore, each state is associated with a probability density defined over X, where X represents a vector of parameters derived from the speech signal every 10 ms. Symbols emitted, according to the probability density associated with the state, are therefore the infinite possible parameter vectors X. This probability density is given by a mixture of Gaussian curves in the multidimensional space of the input vectors.
Also in case of hidden Markov models, words to be recognized are represented as a concatenation of phonetic units and use is made of a dynamic programming algorithm (Viterbi algorithm) to find out the word uttered with the highest probability, given the input speech signal.
More details about this recognition technique can be found e.g. in: L. Rabiner, B- H. Juang “Fundamentals of speech recognition”, Prentice Hall, Englewood Cliffs, N.J. (USA).
The method of this invention makes use of both the neural network technique and the Markov model technique through a two-step recognition and a combination of the results obtained by means of both techniques.
A recognition system in which scores of different recognisers are combined to improve performance in terms of recognition accuracy is described in the paper “Speech recognition using segmental neural nets” by S.Austin, G.Zavaliagkos, J. Makhoul and R. Schwartz, presented at the ICASSP 92 Conference, San Francisco, March 23-26, 1992.
This known system performs a first recognition by means of hidden Markov models, providing a list of the N best recognition hypotheses (for instance: 20), i.e. of the N sentences that have the highest probability to be the sentence being actually uttered, along with their likelihood scores. The Markov recognition stage also provides for a phonetic segmentation of each hypothesis and transfers the segmentation result to a second recognition stage, based on a neural network. This stage performs recognition starting from the phonetic segments supplied by the first Markov step and provides in turn a list of hypotheses, each associated with a likelihood score, according to the neural recognition technique. Both scores are then linearly combined so as to form a single list, and the best hypothesis originating from such a combination is chosen as recognised utterance.
A system of this kind has some drawbacks. A first drawback is due to the second recognition step being performed starting from phonetic segments supplied by the first step: if segmentation is affected by time errors, the second step shall in turn produce recognition errors that propagate to the final list. Furthermore, such a system is inadequate for isolated word recognition within large vocabularies, since it employs as a first stage the Markov recognizer which under such particular circumstances is slightly less efficient than the neural one in terms of computational burden. Additionally, if one considers that the hypotheses provided by a Markov recognizer and a neural network recognizer show rather different score dynamics, a shear linear combination of scores may lead to results which are not significant. Finally, the known system does not supply any reliability information about the recognition effected.
Availability of said information in systems exploiting isolated word recognition is on the other hand a particularly important feature: as a matter of fact, these systems generally request the user to confirm the uttered word, thus causing a longer procedure time. If reliability information is provided, the system can request confirmation only when recognition reliability falls below a given threshold, speeding up the procedure with benefits for both the user and the system operator.
OBJECT OF THE INVENTION
The purpose of the invention is to provide a recognition method and device of the above type, which are conveniently designed to recognise isolated words within large vocabularies and which allow improving the recognition accuracy and obtaining a recognition reliability evaluation.
SUMMARY OF THE INVENTION
In particular, the method according to this invention is characterised in that the two recognition steps operate sequentially on a same utterance to be recognized, in such a way that the neural step analyses the entire active vocabulary and the Markov step analyses a partial vocabulary only, represented by the list of hypotheses provided as the neural step result, and in that additionally an evaluation of recognition reliability is made for the best hypothesis of the re-ordered list, based on the scores resulting from the combination and associated with such best hypothesis and to one or more hypotheses lying in subsequent positions in the re-ordered list, thereby producing a reliability index that may have at least two values corresponding to a recognition rated as “certain” or as “not certain”, respectively.
A recognizer for carrying out the method has a neural network recognition unit is located before the recognition unit based on hidden Markov models and is capable of effecting its recognition by operating on the entire active vocabulary, and the recognition unit based on hidden Markov models is capable of effecting its recognition independently of the the neural network recognition unit, by acting on a partial vocabulary formed by the hypotheses contained in the list supplied by the neural network unit; and in that the processing unit comprises evaluation means for evaluating recognition reliability for the hypothesis that has the best likelihood score in the re-ordered list of hypotheses, by using the combined scores associated with the hypotheses contained in the re-ordered list, said evaluation means being capable of supplying
Fissore Luciano
Gemello Roberto
Ravera Franco
Azad Abul K.
CSELT - Centro Studi e Laboratori Telecomunicazioni S.P.A.
Dubno Herbert
Hudspeth David R.
LandOfFree
Method of and a device for speech recognition employing... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method of and a device for speech recognition employing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of and a device for speech recognition employing... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2568772