Speech recognition from overlapping frequency bands with...

Data processing: speech signal processing – linguistics – language – Speech signal processing – For storage or transmission

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S231000, C704S236000, C704S251000

Reexamination Certificate

active

06721698

ABSTRACT:

FIELD OF THE INVENTION
This invention relates to speech recognition.
BACKGROUND OF THE INVENTION
Speech recognition is well known in the field of computers. Nowadays it is being applied to mobile telephones and particularly to enable voice dialling functionality. With voice dialling, a user can, for example, say the name of a person whom he or she wants to call to, the telephone recognises the name and then looks up a corresponding number. Alternatively, the user may directly say the telephone number he requires. This is convenient, since the user does not have to use keys. It is desirable to increase the ability of mobile telephones to understand spoken words, letters, numerals and other spoken information to a greater extent. Unfortunately, current speech recognition techniques require too much processing capacity to be practically used in a small portable mobile telephone.
Speech recognition functionality can be implemented in a telephone network, in such a way that a telephone user's speech is recognised in the network rather than in a handset. By locating speech recognition functionality in the network, greater processing power can be made available. However, the accuracy of speech recognition is degraded by distortions introduced into the speech signal and by the reduction in bandwidth that results from its transmission to the network, In a typical landline connection, the bandwidth of the speech signal transferred to the network is only about 3 kHz, which means that a significant part of the voice spectrum is lost and thus the information it contains is unavailable for use in speech recognition. This problem can be avoided by dividing speech recognition functionality between the telephone handset and the network.
WO 95/17746 describes a system in which an initial stage of speech recognition is carried out in a remote station. The remote station generates parameters characteristic of the voice signal, so-called “speech features” and transmits them to a central processing station which is provided with the functionality to process the features further. In this way, the features can be extracted e.g. from a speech signal using the entire spectrum captured by a microphone of the remote station. Additionally, the required transmission bandwidth between the remote station and the central processing station is also reduced. Instead of transmitting a speech signal to convey the speech in electrical format, only a limited number (e.g. tens) of parameters (features) are transmitted for each speech frame.
The two main blocks typically present in speech recognition systems are a signal processing front-end, where feature extraction is performed, and a back-end, where pattern matching is performed to recognise spoken information. It is worth mentioning, that division of speech recognition into these two parts, front-end and back-end, is also feasible in cases other than a distributed speech recognition system. The task of the signal processing front-end is to convert a real-time speech signal into some kind of parametric representation in such a way that the most important information is extracted from the speech signal. The back-end is typically based on a Hidden Markov Model (HMM) that adapts to a speaker so that the probable words or phonemes are recognised from a set of parameters corresponding to distinct states of speech. The speech features provide these parameters. The objective is that the extracted feature vectors are robust to distortions caused by background noise, a communications channel, or audio equipment (for example, that used to capture the speech signal).
Prior art systems often derive speech features using a front-end algorithm based on Mel-Frequency Cepstral Coefficients (MFCCs). MFCCs provide good accuracy in situations where there is little or no background noise, but performance drops significantly in the presence of only moderate levels of noise. Thus, there is a need for a method that has a corresponding performance at low levels of background noise and significantly better performance in noisier conditions.
The noise which disturbs the speech recognition process originates from various sources. Many of these noise sources are so-called convolutional noise sources. In other words, the effect they have on the speech signal can be represented as a mathematical convolution between the noise source and the speech signal. The vocal tract of the user and the electrical components used in speech acquisition and processing can both be considered as convolutional noise sources. The user's vocal tract has an acoustic transfer function determined by its physical configuration and the electrical components of the acquisition and processing system have certain electronic transfer functions. The transfer function of the user's vocal tract affects, among other things, the pitch of the spoken information uttered by the user, as well as its general frequency characteristics. The transfer functions of the electrical components, which usually include a microphone, an amplifier(s) and an Analogue-to-Digital (AID) converter, for converting the signal captured by the microphone into digital form, affect the frequency content of the captured speech information. Thus, both the user-specific transfer function of the vocal tract and the device-specific electronic transfer function(s) effectively cause inter-user and inter-device variability in the properties of the speech information acquired for speech recognition. The provision of a speech recognition system that is substantially immune to these kinds of variations is a demanding technical task.
Speech recognition of a captured speech signal typically begins with A/D-conversion, pre-emphasis and segmentation of a time-domain electrical speech signal. At the pre-emphasis stage, the amplitude of the speech signal is enhanced in certain frequency ranges, usually those in which the amplitude is smaller. Segmentation segments the signal into frames representing a short time period, usually 20 to 30 milliseconds. The frames are formed in such a way that they are either temporally overlapping or non-overlapping. Speech features are generated using these frames, often in the form of Mel-Frequency Cepstral Coefficients (MFCCs). It should be noted that although much of the description which follows concentrates on the use of Mel-Frequency Cepstral Coefficients in the derivation of speech features, application of the invention is not limited to systems in which MFCCs are used. Other parameters may also be used as speech features. WO 94/122132 describes the generation of MFCCs. The operation of an MFCC generator described in that publication is shown in
FIG. 1. A
segmented speech signal is received by a time-to-frequency-domain conversion unit. In step
101
, a speech frame is transformed into the frequency domain with a Fast Fourier Transform (FFT) algorithm to provide 256 transform coefficients. In step
102
, a power spectrum of 128 coefficients is formed from the transform coefficients. In step
103
, the power spectrum is integrated over 19 frequency bands to provide 19 band power coefficients. In step
104
, a logarithm is computed from each of the 19 band power coefficients to provide 19 log-values. In step
105
, a Discrete Cosine Transform (DCT) is performed on the 19 log-values. The frequency domain signal is then processed in a noise reduction block in order to suppress noise in the signal. Finally, the 8 lowest order coefficients are selected.
It should be appreciated, that the numbers of samples and various coefficients referred to in WO 94/22132 represent only one example.
It is a characteristic of linear transforms, for example DCTs, that disturbance caused by noise in a certain frequency band is spread to surrounding frequency bands. This is an undesirable effect, particularly in speech recognition applications.
In Okawa et al., “Multiband Speech Recognition In Noisy Environments,” IEEE, 1998, pp. 641-644 (IEEE 0-7803-4428-6/98) a multi-band automatic speech recognition method is presented. In this method a

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speech recognition from overlapping frequency bands with... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speech recognition from overlapping frequency bands with..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech recognition from overlapping frequency bands with... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3266312

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.