Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-10-24
2004-02-10
McFadden, Susan (Department: 2655)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S256000, C704S205000
Reexamination Certificate
active
06691090
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to speech recognition. More particularly, the present invention relates to speech recognition in a radio communication system and/or in a Distributed Speech Recognition (DSR) system.
The main objective of speech recognition is to provide quick and easy access to a wide variety of computer services and communication systems by using human speech. Speech recognition applications range from simple voice control using a limited number of basic command words, like “yes” or “no”, or numbers from zero to nine, to much more flexible systems capable of turning spontaneous speech into written text, i.e. dictation systems. In dictation-like applications the vocabulary is typically very extensive, containing tens of thousands of words, and thus in such systems, which are known as Large Vocabulary Continuous Speech Recognition (LVCSR) systems, computational complexity and memory requirements are very high.
A general speech recognition system can roughly be divided into two main parts. First the most important characteristics of the speech signal are captured in a pre-processing stage called feature extraction, and this part of the speech recognition system is called the front-end (FE). The front-end converts a sampled speech waveform into a representation more suitable for recognition purposes. Feature extracted parameters, known as feature vectors, are then fed into the recogniser or back-end (BE), which performs the actual probability estimation and classification, that is to say, the back-end carries out the recognition and outputs the result. The more complex the recognition task, the more important it is to have good quality feature vectors. Variation in speech owing to different speakers having for instance different dialects, or talking at different speeds are factors which affect a speech recognition system. Environmental noise and distortion are further factors which deteriorate the quality of feature vectors, and in turn, influence the performance of the speech recognition system as a whole. Although the FE can provide some robustness against these factors, the quality of the speech fed to the FE is critical.
Speech recognition technology is growing in its application in mobile telecommunications. Cellular phones that are able to make a call by simply listening to the phone number or the name of the person the user wants to talk to are already available. However, more complex tasks, such as dictation, are still very difficult to implement in a mobile environment. Since it is crucial to provide the recogniser with as good quality speech as possible, it would seem logical to try to place the recogniser as close to the user as possible, i.e., directly in the telephone handset. However, the computational load and memory demands of LVCSR do not make this a viable approach.
To address these problems, it has been proposed to place the BE at a central place in the cellular network, whilst the FE part, with its comparatively low computational demands, can be located in the telephone handset. In this way it is possible to take advantage of high performance computers in the cellular network which can be shared by many users at a time. This type of arrangement of a speech recognition system over the network is referred to as Distributed Speech Recognition (DSR). In DSR, it is proposed that the speech signal is transformed into feature vectors locally at the handset and these are transmitted as digital data over the transmission channel relatively free of errors. When feature vectors are extracted at the handset, the BE can operate on the data stream, or sequence of feature vectors which usually represent high quality speech, and can therefore achieve good recognition performance.
A commonly used approach for carrying out feature extraction is the cepstral approach, and using this approach the feature vectors that are extracted are called mel-frequency cepstral coefficients or MFCCs. The basis for the cepstral approach is basically related to the nature of the speech signal itself and particularly concerns the distortions it undergoes during the first stages of its acquisition and processing. It is widely accepted, that the speech signal is contaminated with a number of convolution noise sources, i.e in the generation and acquisition of the speech signal, a number of factors cause the speech to be altered in such a way that the disturbance to the signal can be modelled as a mathematical convolution between the speech signal and each of the disturbing factors.
The first of these arises due to the physiological processes involved in the formation of human speech. The driving force of the speech formation process is air expelled by the lungs. It is argued that because the human respiratory tract, including the lungs themselves, the trachea, the pharyngeal, oral and nasal cavities, has a certain geometry, it has a natural frequency response, or acoustic transfer function. This can be thought of in the same terms as the transfer function of an electronic circuit. Just as the transfer function of an electronic circuit becomes convoluted with an electrical signal that is applied to the circuit, so the periodic vibrations of the vocal chords, which form the speech signal, undergo a convolution with the acoustic transfer function of the human respiratory tract. In other words, the geometry of the respiratory tract can be thought of as giving rise to a convolutional ‘noise’ source that distorts the speech signal. Furthermore, when the speech signal is detected, for example using a microphone, and transferred to some input circuitry for amplification, the transfer functions of the microphone, the transmission line and the amplifier circuitry also become convoluted with the speech signal. There are also likely to be a number of additive noise sources, for example background or environmental noise detected by the microphone along with the speech signal.
Therefore, when processing a speech signal, the problem of minimising the effect of the convolutional and additive noise must be addressed. Electronic filters can be designed to reduce the effect of additive background noise, although this in itself may be complicated, as the nature of the background noise may vary significantly from location to location and also as a function of time. However, filtering cannot be used to reduce the effect of convolutional noise and by their very nature, the analysis of convoluted signals in the time domain is very complicated.
It is known that a convolution operation in the time domain can be transformed into a multiplication operation in the frequency domain by applying a Fourier transform to the time domain signal. This is a standard approach used in a wide variety of digital signal processing applications, for example to analyse the transfer functions of filters etc. Typically, in DSP applications, a Fourier transform is performed using a Fast Fourier transform (FFT) algorithm which is computationally very much more efficient than a Discrete Fourier transform (DFT).
Performing an FFT is also the first step in forming a cepstral representation of a time domain signal. In transforming the speech signal into the frequency domain using a Fourier transform, convolutional effects, such as the distortion in the speech signal due to the acoustic properties of the human respiratory tract, are converted into multiplicative factors. The next step in calculating a cepstral representation of a speech signal is to take the logarithm of the Fourier transformed speech signal. A further Fourier transform is then performed to produce the cepstrum. In speech processing applications, a Discrete Cosine Transform (DCT) is often used instead of an FFT at this stage, because it offers a further increase in computational efficiency. In the cepstrum, all of the effects of time-domain convolutions are reduced to additive terms and it can be shown theoretically and experimentally that this kind of representation of speech signal provides a much more reliable representation than convention
Laurila Kari
Tian Jilei
Antonelli Terry Stout & Kraus LLP
McFadden Susan
Nokia Mobile Phones Limited
LandOfFree
Speech recognition system including dimensionality reduction... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speech recognition system including dimensionality reduction..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech recognition system including dimensionality reduction... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3345510