Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1995-11-21
2001-05-08
Tsang, Fan (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S231000
Reexamination Certificate
active
06230128
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to speech processing and in particular to a system for processing alternative parses of connected speech.
2. Related Art
Speech processing includes speaker recognition, in which the identity of a speaker is detected or verified, and speech recognition, wherein a system may be used by anyone without requiring recogniser training, and so-called speaker dependent recognition, in which the users allowed to operate a system are restricted and a training phase is necessary to derive information from each allowed user. It is common in recognition processing to input speech data, typically in digital form, to a so-called front-end processor, which derives from the stream of input speech data a more compact, perceptually significant set of data referred to as a front-end feature set or vector. For example, speech is typically input via a microphone, sampled, digitised, segmented into frames of length 10-20 ms (e.g. sampled at 8 kHz) and, for each frame, a set of coefficients is calculated. In speech recognition, the speaker is normally assumed to be speaking one of a known set of words or phrases. A stored representation of the word or phrase, known as a template or model, comprises a reference feature matrix of that word as previously derived from, in the case of speaker independent recognition, multiple speakers. The input feature vector is matched with the model and a measure of similarity between the two is produced.
Speech recognition (whether human or machine) is susceptible to error and may result in the misrecognition of words. If a word or phrase is incorrectly recognised, the speech recogniser may then offer another attempt at recognition, which may or may not be correct.
Various ways have been suggested for processing speech to select the best or alternative matches between input speech and stored speech templates or models. In isolated word recognition systems, the production of alternative matches is fairly straightforward: each word is a separate ‘path’ in a transition network representing the words to be recognised and the independent word paths join only at the final point in the network. Ordering all the paths exiting the network in terms of their similarity to the stored templates or the like will give the best and alternative matches.
In most connected recognition systems and some isolated word recognition systems based on connected recognition techniques however, it is not always possible to recombine all the paths at the final point of the network and thus neither the best nor alternative matches are directly obtainable from the information available at the exit point of the network. One solution to the problem of producing a best match is discussed in “Token Passing: a Simple Conceptual Model for Connected Speech Recognition Systems” by S. J. Young, N. H. Russell and J. H. S. Thornton 1989, which relates to passing packets of information, known as tokens, through a transition network. A token contains information relating to the partial path travelled as well as an accumulated score indicative of the degree of similarity between the input and the portion of the network processed thus far.
As described by Young et al, at each input of a frame of speech to a transition network, any tokens that are present at the input of a node are passed into the node and the current frame of speech matched within the word models associated with those nodes. New tokens then appear at the output of the nodes (having “travelled” through the model associated with the node). Only the best scoring token is then passed onto the inputs of the following nodes. When the end of speech has been signalled (by an external device such as a pause detector), a single token will be present at the final node. From this token the entire path through the network can be extracted by tracing back along the path by means of the previous path information contained within the token to provide the best match to the input speech.
The article “A unified direction mechanism for automatic speech recognition using Hidden Markov Models” by S. C. Austin and F. Fallside, ICASSP 1989, Vol. 1, pages 667-670, relates to a connected word speech recogniser which operates in a manner similar to that described by Young et al, as described above. A history relating to the progress of the recognition through the transition network is updated on exiting the word model. At the end of recognition, the result of recognition is derived from the history presented to the output which has the best score. Again only one history is possible for each path terminating at the final node.
Such known arrangements do not allow for an alternative choice to be readily available at the output of the network.
SUMMARY OF THE INVENTION
In accordance with the invention a path link passing speech recognition system for recognising input connected speech comprises means for deriving recognition feature data from an input speech signal, processing means for modelling expected input speech and for comparing the recognition feature data with the modelled expected input speech, the processing means having a plurality of vocabulary nodes associated with word representation models, and means for indicating recognition of the input speech signal in dependence upon the comparison, characterised in that at least one of the vocabulary nodes can process more than one path link simultaneously.
Such an arrangement means that more than one incoming path link can be processed by a node at a given time and hence that more than one recognition result may be obtained.
The modelling means preferably comprises a transition network containing a plurality of noise nodes and vocabulary nodes which are associated with word representation models. The nodes are capable of producing path links comprising fields for storing a pointer to the previous path link, an accumulated score for a path, a pointer to a previous node and a time index for segmentation information. Preferably, the vocabulary nodes capable of processing more than one path link have more than one identical associated word representation model.
The provision that at least one of the vocabulary nodes of the network has more than one associated word representation model allows the speech recogniser to process multiple paths at the same time and so allows more than one path link to be propagated across each inter-node link at each input frame. In effect, the invention creates multiple layers of a transition network along which several alternative paths may be propagated. The best scoring path may be processed by the first model of a node, the next best by the second and so on until either parallel models or incoming paths run out.
In general terms “network” includes directed acyclic graphs (DAGs) and trees. A DAG is a network with no cycles and a tree is a network in which the only meeting of paths occurs conceptually right at the end of the network.
The term “word” here denotes a basic recognition unit, which may be a word but equally well may be a diphone, phoneme, allophone, etc. Recognition is the process of matching an unknown utterance with a predefined transition network, the network having been designed to be compatible with what a user is likely to say.
In order to identify the phrase that has been recognised, the system may include means for tracing the path link back through the network.
Alternatively, the system may also include means for assigning a signature to at least some of the nodes having associated word representation models and means for comparing the signature of each path, to determine the path with the best match to the input speech and that with the second best alternative match.
This arrangement allows for an alternative which is necessarily different in character to the best match and does not differ merely in segmentation or noise matches.
The word representation models may be Hidden Markov Models (HMMs) as described generally in British Telecom Technology Journal, April 1988, Vol, 6, no. 2, page 105:
British Telecommunications public limited company
Nixon & Vanderhye P.C.
Opsasnick Michael N.
Tsang Fan
LandOfFree
Path link passing speech recognition with vocabulary node... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Path link passing speech recognition with vocabulary node..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Path link passing speech recognition with vocabulary node... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2570120