Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-06-10
2002-10-01
Knepper, David D. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
Reexamination Certificate
active
06460017
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention is directed to a method for adapting hidden Markov models to operating demands of a speech recognition systems, particularly using specifically formed, multilingual hidden Markov sound models that are adapted to an applied language.
2. Description of the Prior Art
A speech recognition system essentially accesses two independent sources of knowledge. First, there is a phoneme lexicon with which the vocabulary to be recognized is defined as vocabulary. For example, the ASCII strings of the individual words to be recognized as well as their phonetic transcription are stored there. This lexicon also prescribes what is referred to as a “task”. Second, there is a code book that contains the parameters of the hidden Markov sound models (HMM) and, thus, particularly contains the mid-points of the probability density distributions belonging to recognition segments.
The best performance of a speech recognition system can be observed when the HMM code book is optimally adapted to the lexicon. This is the case when the HMM code book is operated together with that lexicon with which this HMM code book was also initially produced by training. When this cannot be assured, then a deterioration in performance is observed.
The problem often arises in speech recognition systems as utilized, for example, in switching systems that the initially trained vocabulary with which this system is delivered is modified by the customer during operation. This usually results therein that co-articulations between phonemes that could not be previously trained occur given the new words. There is thus a mismatch between lexicon and HMM code book, which leads to a deteriorated recognition performance in practical operation.
A practical example of such a situation would be a telephone exchange of a company that understands the names of the employees and automatically recognizes the connection request of a caller on the basis of his speech input and forwards the call to the corresponding extension (call-by-name). The names of the employees are thus stored in the lexicon. The names will change over and over again due to fluctuation, and the system will therefore exhibit an unsatisfactory recognition performance for said reasons.
In order to assure an optimally high recognition performance of a speech recognition system under the described conditions of use, it is thus necessary to implement an adaption of the underlying HMM code book of this recognition system to the newly established task. Different methods for solving this problem are known from the prior art. Hon. H. W., Lee K. F., “On Vocabulary-Independent Speech Modeling”, Proc. IEEE Intern. Conf. on Acoustics, Speech, and Signal Processing, Albuquerque N. Mex., 1990 discloses a solution wherein it is proposed to implement a retraining for adaption of the code book to the lexicon. This procedure has the disadvantage that the vocabulary of the ultimate application is generally only partly known at the time of training. If the retraining must then be started at a later point in time, then all potentially required acoustic models of a new vocabulary must be kept on hand, which is uneconomical and would be difficult to implement in practice.
What is referred to as a MAP algorithm (maximum a posteriori) for the adaptation of the acoustic models by the user on the basis of a specific set of speech samples is disclosed by Lee C. H., Gauvain J. L., “Speaker Adaption Based on MAP Estimation of HMM Parameters”, Proc. IEEE Intern. Conf. on Acoustics, Speech and Signal Processing, Minneapolis Minn., 1993. The purchaser of the speech recognition system must thereby make speech samples of a number of speakers available. The re-adaption of the code book thereby ensues by monitored learning, i.e. that the system must be informed of the correct transliteration of an expression. The complicated work steps that are thereby required cannot be expected of a customer.
Both solutions from the prior art have the common disadvantage that they only sequence off-line. For an HMM code book adaption, thus, the running system must be shut down so that the new parameters, i.e. the corresponding recognition units can be played into the system. Further, the procedures of training and adaption require a long time for being worked in and implemented, which means a financial disadvantage for the purchaser. An initial code book for the HMM is therefore often offered when the product is delivered. Two training strategies for this are available from the prior art.
On the one hand, the code book can be generated on the basis of a phonetically balanced training dataset. Such code books offer the advantage that they can handle all conceivable applications of unknown tasks since they do not prioritize any recognition units. The speech recognition system is thereby trained to exactly the same vocabulary that plays a part in the ultimate application. A higher recognition rate for the specific application is thereby mainly achieved in that the speech recognition system can make use of co-articulations that it already received in the training phase. However, such specialist code books exhibit poorer performances for applications wherein the lexicon changes.
When the lexicon and, thus, the vocabulary of the ultimate application can be modified, or is even entirely unknown at the training time, then manufacturers must, sometimes with difficulty, work an optimally generally prepared code book into their speech recognition systems.
D. B. Paul et al., “The Lincoln-Large Vocabulary Stack-Decoder HMM CSR”, Vol. 2 of 5, Apr. 27, 1993, IEEE also discloses that a speech recognition system be adapted to a new speaker in real time. Since, however, the vocabulary in this known system is limited and fixed, it cannot be derived from the Paul et al. article as to how that a modification of the vocabulary could be implemented with such a method.
A significant problem is also that new acoustic phonetic models must be trained for every language in which the speech recognition technology is to be introduced in order to be able to implement a national match. HMMs for modelling the language-specific sounds are usually employed in speech recognition systems. Acoustic word models that are recognized during a search process in the speech recognition procedure are subsequently compiled from these statistically modelled sound models. Very extensive speech data banks are required for training these sound models, the collection and editing of these representing an extremely cost-intensive and time-consuming process. Disadvantages thereby arise when transferring a speech recognition technology from one language into another language since the production of a new speech data bank means, on the one hand, that the product becomes more expensive and, one the other hand, causes a time delay in the market introduction.
Language-specific models are exclusively employed in standard purchasable speech recognition systems. Extensive speech data banks are collected and edited for transferring these systems into a new language. Subsequently, the sound models for the new language are retrained from square one with these collected voice data.
In order to reduce the outlay and the time delay when transferring speech recognition systems into different languages, an examination should thus be made to see whether individual sound models are suitable for employment in different languages. P. Dalsgaard and O. Anderson, “Identification of Mono- and Poly-phonemes using acoustic-phonetic Features derived by a self-organising Neural Network”, in Proc. ICSLP '92, pages 547-550, Banff, 1992 discloses already provides approaches for producing multilingual sound models and utilizing these in the speech recognition in the respective languages. The terms polyphoneme and monophoneme are also introduced therein, with polyphonemes defined as sounds whose sound formation properties are similar enough over several languages in order to be equated. Monophonemes indicate sounds that exhibit language
Bub Udo
Höge Harald
Köhler Joachim
Bell Boyd & Lloyd LLC
Knepper David D.
Siemens Aktiengesellschaft
LandOfFree
Adapting a hidden Markov sound model in a speech recognition... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Adapting a hidden Markov sound model in a speech recognition..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Adapting a hidden Markov sound model in a speech recognition... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2982785