Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-04-27
2001-10-09
Dorvil, Richemond (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S256000, C704S241000, C704S236000, C704S251000, C704S231000
Reexamination Certificate
active
06301562
ABSTRACT:
BACKGROUND
The invention relates to a speech recognition method and a system for carrying out the method.
A number of prior art speech recognition systems are known. Most commercial approaches use a hidden Markov model (HMM). In this model, short intervals of speech are processed using a probabilistic model of the likelihood of any given word or sub-word producing a given output. The short intervals of speech may overlap, and may be parameterised by spectral parameters, for example output from a filter bank, a discrete Fourier transform, or even the parameters from a linear predictive coding analysis of the input speech. The best match of the input speech to the model is then determined. The values of probability used in the model are generated using a training phase. This approach, being standard in the art, is conventional and will not be further described.
Many commercial packages use this approach together with a linguistic engine that uses information about the language spoken to cut down the likely possibilities. This approach has led to several packages achieving hit rates of about 97%. There is, however, a need to increase this figure.
An approach known as time encoded speech (TES), or TESPAR, has been described in GB 2020517, GB 2084433, GB 2162024, GB 2162025, GB 2187586, GB 2179 183, WO 92/15089, WO97/31368, WO97/45831 and WO98/08188, which are hereby incorporated by reference in their entirety. In this approach, speech is coded into a small number of symbols. Speech recognition systems using speech encoded in this way have been proposed, inter alia, in WO 97/45831 and GB 2 187 586. However, the approach does not appear to have been widely implemented; it is believed that high recognition rates have not been achieved with the system.
According to the invention there is provided a speech recognition method including
inputting speech to be recognised,
encoding the input speech using time encoding,
using a hidden Markov model to determine scores indicating how the input speech matches some or all of a plurality of speech elements,
determining which, if any, speech element best corresponds to the input speech using the time encoded speech and the Markov scores, and
outputting the speech element, if any, so determined.
The speech waveform may be characterised by fluctuations in pressure about a mean value, which will be considered the “zero” value for the purposes of time encoding, described below. The input function is therefore a single valued function that oscillates about a zero value with a finite range of frequencies. Such a band-limited function is ideally suited to TESPAR analysis.
Once the input device has recorded the speech waveform some form of pre-processing is usually in order. This may include filtering the signal to remove frequencies outside the bandwidth covered by speech. For frequency analysis using the HMM method the signal is then divided into short time segments (say 10 ms).
TESPAR can be used with a signal that is broken up into any length of time. Therefore, the signal can be divided up into short time segments in a similar manner to that used in the HMM. Alternatively, the signal can be divided up into separate words, phrases or even sentences. TESPAR can be used directly to divide up the signal according to some criterion. An example is finding the end points of an utterance. An example of how this can be achieved is to take short time segments and encode each segment into an ‘S’ matrix. If the sum of the matrix elements for each time segment is found the result is a vector of numbers indicating how much sound is present in each. This can then be used to find the transitions between sound and silence and hence the end points of the utterance.
There are many ways in which the speech signal may be time encoded. An example of the time encoding procedure is now described. The first step is to divide the signal to be encoded into sections at the points where the signal crosses the zero line. These sections are referred to as epochs. Each epoch is then categorised according to its duration, the number of complex zeros that occur in its duration and the maximum amplitude of the signal. The epochs in the list are then assigned to particular groups and the resulting distribution of epochs in the different groups is used to characterise the encoded signal. In a simple case this could mean assigning each epoch to a group determined by its shape, duration and magnitude. The simple one-dimensional histogram of the number of epochs in each group is then used to characterise the signal.
The Hidden Markov Model (HMM) may take short segments of the input signal and Fourier transform them. The resulting spectrum may then be used to assign the time segment to a particular sub-phone. The sequence of these sounds may then be fed into the model and a probability output for each word considered. Thus a ranking of words is produced that specifies which word was most likely to have given rise to the observed speech waveform.
One possible method of enhancing the recognition process is to use the time encoded signal to provide additional input parameters for the HMM. One such possibility is for the time encoded signal to be used to determine the identity of the speaker so that the HMM parameters may be modified accordingly.
Both the HMM and the TESPAR system produce probabilities for matches between the input speech and the speech elements in the systems vocabulary. TESPAR is, in addition, well suited to distinguishing between a predetermined selection of sounds. Thus if one model narrows the number of likely words corresponding to the input speech down to a number of possibilities the other model will probably be able to select which is the most likely from the shortlist. In this way the overall accuracy of the speech recognition system can be enhanced by including information from the time domain, in the form of TESPAR encoding, as well as information from the frequency domain.
Various methods exist for deriving scores for different speech elements using the TESPAR method. For example, correlation scores can be found between the matrix generated from the input signal and the archetype matrix for each speech element. More commonly a neural net can be trained, using known examples, to differentiate between different speech elements.
The time encoding may include the steps of
identifying the intervals between the occurrences of the input parameter crossing a given value, and quantising the lengths of the intervals,
identifying the number of complex zeroes of the input parameter, up to a predetermined rank, in the said intervals, and
recording the quantised lengths of the intervals and a measure of the said number of complex zeroes up to a predetermined rank as a representation of the variation of the input parameter.
A predetermined rank of 1 has been found to give good results. In this case the method records the number of first rank zeroes, i.e. positive minima or negative maxima. This information may provide sufficient detail for useful characterisation without requiring excessive calculation.
The method thus parameterises the shape of the input parameter function. If the parameter rises smoothly to a maximum and then falls smoothly to the next zero, there will be no positive minima so said number will be zero.
If the function has an “M” shape, rising to a maximum, falling to a minimum and then rising to another maximum before passing through zero, then there will be one positive minimum so the said number will be one.
Thus, the number parameterises the number of oscillations of the input parameter between zeroes, i.e. in each epoch.
The reason that the positive minima or negative maxima are known as complex zeroes of a function is that they correspond to zeroes of the function for complex number inputs to the function. The first rank zeroes occur at real values being the real values of the complex numbers for which the function has a value zero.
The coding method may be a TESPAR method .
The method may further comprise the step of generating a code number taking one
Azima Henry
Ferekidis Charalampos
Kavanagh Sean
Dorvil Richemond
Foley & Lardner
New Transducers Limited
Nolan Daniel A.
LandOfFree
Speech recognition using both time encoding and HMM in parallel does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Speech recognition using both time encoding and HMM in parallel, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech recognition using both time encoding and HMM in parallel will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2597481