Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission
Reexamination Certificate
2000-04-28
2002-12-31
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
For storage or transmission
C704S220000, C704S201000
Reexamination Certificate
active
06502070
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to the field of speech recognition and more particularly to a method and apparatus for normalizing channel specific feature elements in a signal derived from a spoken utterance. In general the method aims to compensate for changes induced in the feature elements of the signal as a result of a transmission of the signal through a certain communication channel.
BACKGROUND OF THE INVENTION
In a typical speech recognition application, the user inputs a spoken utterance into an input device such as a microphone or telephone set. If valid speech is detected, the speech recognition layer is invoked in an attempt to recognize the unknown utterance. In a commonly used approach, the input speech signal is first pre-processed to derive a sequence of speech feature elements characterizing the input speech in terms of certain parameters. The sequence of speech feature elements is then processed by a recognition engine to derive an entry from a speech recognition dictionary that most likely matches the input speech signal. Typically, the entries in the speech recognition dictionary are made up of symbols, each symbol being associated to a speech model.
Prior to the use of a speech recognition system, the entries in the speech recognition dictionary as well as the speech models are trained to establish a reference memory and a reference speech model set. For speaker-independent systems, training is performed by collecting samples from a large pool of users. Typically, for a speaker-independent system, a single speech model set is used for all speakers while in a speaker-specific system, each user is assigned a respective speech model set. Speaker-specific systems are trained by collecting speech samples from the end user. For example, a voice dictation system where a user speaks and the device translates his words into text will most likely be trained by the end user (speaker-specific) since this training fashion can achieve a higher recognition accuracy. In the event that someone else than the original user wants to use the same device, that device can be retrained or an additional set of speech models can be trained and stored for the new user. As the number of users becomes large, storing a separate speaker specific speech model set for each user becomes prohibitive in terms of memory requirements. Therefore, as the number of users becomes large, speech recognition systems tend to be speaker independent.
In addition to interacting with different users, it is common for a speech recognition system to receive the signal containing the spoken utterance on which the speech recognition process is to be performed over different communication channels. In a specific example, a speech recognition system operating in a telephone network may process speech signals originating from a wireline communication channel or wireless communication channel, among others. Generally, such speech recognition systems use a common speech model set across the different communication channels. However, the variability between communication channels results in variability in the acoustic characteristics of the speech signals. Consequently, the recognition performance of a speech recognition system is adversely affected since the speech models in the common speech model set do not reflect the acoustic properties of the speech signal, in particular the changes induced in the signal by the channel used for transporting the signal. Since different channels can be used to transport the signal toward the speech recognition apparatus, and each channel induces different changes in the signal it is difficult to adapt the speech models such as to accurately compensate for such channel specific distortions introduced in the feature elements.
A commonly used technique to overcome this problem is exhaustive modeling. In exhaustive modeling, each communication channel that the speech recognition system is adapted to support is associated to a channel specific speech model set. For each channel, a plurality of speech samples are collected from the end-users in order to train a channel specific speech model set.
A deficiency in exhaustive modeling is that it requires a large amount of training data for each communication channel. This represents an important commissioning expense as the number of environmental and channel conditions increases.
Another common approach to improve the performance of speech recognition systems is adaptation: adjusting either speech models or features in a manner appropriate to the current channel and environment. A typical adaptation technique is model adaptation. Generally, model adaptation starts with reference speech models derived from one or more spoken utterances over a reference communication channel (say wireline communication channel) and then, based on a small amount of speech from a new communication channel (say a wireless communication channel), new channel-specific models are iteratively generated. For a more detailed explanation on model adaptation, the reader is invited to consult R. Schwartz and F Kubala, Hidden Markov Models and Speaker Adaptation, Speech Recognition and Understanding: Recent Advances, Eds: P. Laface et R. De Mori, Springer-Verlag, 1992; L. Neumeyer, A. Sankar and V. Digalakis, A Comparative Study of Speaker Adaptation Techniques, Proc. Of EuroSpeech '95, pp. 1127-1130, 1995; J.-L. Gauvain, G.-H. Lee, Maximum a Posteriori Estimation for Multivariate Gaussain Mixture Observations of Markov Chains, IEEE. Trans. on Speech and Audio Processing, Vol.2, April 1994, pp. 291-298; and C. J. Leggetter, P. C. Woodland, Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models, Computer, Speech and Language, Vol.9, 1995, pp. 171-185. The content of these documents is hereby incorporated by reference.
A deficiency in the above-described methods is that, in order to obtain channel-specific models providing reasonable performance, a relatively large amount of data that may not be readily available is required.
Consequently, there is a need in the industry for providing a method and apparatus to compensate for channel specific changes induced in the feature elements of a signal derived from a spoken utterance, the signal being intended for processing by a speech recognition apparatus.
SUMMARY OF THE INVENTION
In accordance with a broad aspect, the invention provides an apparatus for normalizing speech feature elements in a signal derived from a spoken utterance. The apparatus has an input for receiving the speech feature elements which are transmitted over a certain channel. The certain channel is a path or link over which data passes between two devices and is characterized by a channel type that belongs to a group of N possible channel types. Non-limiting examples of possible channel types include a hand-held channel (a path or link established by using a hand-held telephone set), a wireless channel (a path or link established by using a wireless telephone set) and a hands-free channel (a path or link established by using a hands-free telephone set) among a number of other possible channel types.
The apparatus includes a processing unit coupled to the input. The processing unit alters or skews the speech feature elements to simulate a transmission over a reference channel that is other than the channel over which the transmission actually takes place.
The signal output by the apparatus is suitable for processing by a speech recognition apparatus.
One of the benefits of this invention is an increase of the speech recognition accuracy by a reduction in the variability introduced in the speech signal on which the recognition is made by the particular channel over which the signal is transmitted.
The reference channel can correspond to a real channel (for example, the hands-free channel) or it can be a virtual channel. Such a virtual channel does not physically exist. It is artificially defined by certain transmission characteristics that are arbitrarily chosen.
In a specific non-limiting e
Boies Daniel
Dumoulin Benoit
Peters Stephen Douglas
Dorvil Richemond
Nolan Daniel
Nortel Networks Limited
Smith Kevin L.
LandOfFree
Method and apparatus for normalizing channel specific speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for normalizing channel specific speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for normalizing channel specific speech... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2979697