Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1999-09-24
2002-06-18
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S235000, C704S239000
Reexamination Certificate
active
06408271
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to the field of speech recognition and speech synthesis. This invention is particularly applicable to the generation of speech recognition dictionaries including phrasal transcriptions for use in speech recognition systems as may be used in a telephone directory assistance system, voice activated dialing (VAD) system, personal voice dialing system and other speech recognition enabled services. This invention is also applicable to text-to-speech synthesizers for generating suitable pronunciations of phrases.
BACKGROUND OF THE INVENTION
Speech recognition enabled services are more and more popular today. The services may include stock quotes, directory assistance, reservations and many others.
In typical speech recognition systems, the user enters his request using isolated word, connected word or continuous speech via a microphone or telephone set. If valid speech is detected, the speech recognition layer of the system is invoked in an attempt to recognize the unknown utterance. Typically, entries in a speech recognition dictionary, usually including transcriptions associated to labels, are scored in order to determine the most likely match to the utterance. The recognition of speech involves aligning an input audio signal with the most appropriate target speech model. The target speech model for a particular vocabulary item is built by concatenating the speech models of the transcription or transcriptions associated to that particular vocabulary item.
Of particular interest here are speech recognizers capable of recognizing complete phrases. Speech recognition dictionaries used in such speech recognition systems often comprise transcriptions for complete phrases, herein designated as phrasal transcriptions. A phrasal transcription is a representation of the pronunciation of the associated complete phrase when uttered by a human. Each phrasal transcription is associated to a label indicative of the orthographic representation of the phrase, herein designated as the orthographic phrase. Typically, multiple phrasal transcriptions are provided for each orthographic phrase thereby allowing for different pronunciations of the phrase. A limit on the total number of phrasal transcriptions in a speech recognition dictionary is imposed due to the inherent computational limits of the speech recognizer as well as due to the memory requirements for storing the phrasal transcriptions. Typically, the limit on the total number of phrasal transcriptions is put into practice by limiting the maximum number of phrasal transcriptions stored for each phrase.
A number of methods have been explored for generating a set of phrasal transcriptions to be included in a speech recognition dictionary. Common methods make use of outer-product procedures to generate the set of phrasal transcriptions. In a typical interaction a group of word transcriptions is generated for each vocabulary item in the orthographic phrase. Following this, permutations of the word transcriptions are used to generate the phrasal transcription. A commonly used permuting rule, herein referred to as the F(i) permuting rule, can be mathematically defined as follows:
F
⁡
(
i
)
=
{
1
+
∏
x
=
1
x
=
i
-
1
⁢
N
x
for
⁢
⁢
i
>
1
1
for
⁢
⁢
i
=
1
where N
i
is the number of word transcriptions in the group of word transcriptions associated with the ith vocabulary item of the orthographic phrase. This permuting rule permutes the ith vocabulary item every F(i) phrasal transcription. A specific example will better illustrate this permuting rule. Consider the following orthographic phrase “Mary's little lamb” comprising three vocabulary items namely “Mary's ”, “little” and “lamb”. The vocabulary items are transcribed using a standard word transcription tool and yield a group of word transcriptions for each vocabulary item.
Mary's (i=1) -->/mEriz/, /Ariz/, m*riz/
little (i=2) -->/lIt*l/, lId*l/, /lIt*/, lId*/
lamb (i=3) -->/lamb/, /lam/
Each word transcription has a word transcription probability associated to it. In this specific example, the word transcription probabilities are as follows:
p(/mEriz/|“Mary's)=0.7
p(/mAriz/|“Mary's”)=0.2
p(/m*riz/|“Mary's”)=0.1
p(/lIt*l/|“little”)=0.46
p(/lId*l/|“little”)=0.44
p(/lIt*/|“little”)=0.06
p(/lId*/|“little”)=0.04
p(/lamb/|“lamb”)=0.6
p(/lam/|“lamb”)=0.4
The word transcription probabilities are used to order and truncate the list of word transcriptions. Typically, the word transcriptions are sorted by likelihood, meaning that the first word transcription has a highest transcription probability. Assuming a word transcription limit of 2 word transcriptions per vocabulary item, the two word transcriptions having the highest score are kept and the remaining word transcriptions are discarded. In this specific example this results in the following word transcription groups for the vocabulary items in the orthographic phrase:
Mary's -->/mEriz/, /mAriz/
little -->lIt*l/, lId*l/
lamb -->/lamb/, /lam/
In the above word transcription groups, the 3
rd
word transcription for “Mary's” and the 3
rd
and 4
th
word transcriptions for “little” have been deleted from the original list. The word transcriptions are then permuted according to the F(i) permuting rule and concatenated leading to the following phrasal transcriptions:
mEriz lIt*l lamb
mAriz lIt*l lamb
mEriz lId*l lamb
mAriz lId*l lamb
mEriz lIt*l lam
mAriz lIt*l lam
mEriz lId*l lam
mAriz lId*l lam
For this specific example, the F(i) permuting rule generated eight permutations of the word transcriptions, with variations of the first word transcription occurring between each phrasal transcription, with variations of the second word transcription occurring every second phrasal transcription and variations of the third word transcription occurring every fourth phrasal transcription. Assuming a phrasal transcription limit of 4 transcriptions per phrase, we then have:
mEriz lIt*l lamb
mAriz lIt*l lamb
mEriz lId*l lamb
mAriz Ild*l lamb
A deficiency of the above-described method is that it emphasizes variations from left-to-right. More specifically, the vocabulary item in the first position in the phrase, in the set of selected phrasal transcriptions, has its word transcriptions permuted several times while vocabulary items appearing later on in the phrase are varied less frequently or not at all as the above example illustrates. Consequently, variations in pronunciations for vocabulary items appearing later in a phrase is modeled less effectively that variations for vocabulary items appearing closer to the beginning of a phrase.
Another deficiency of the above noted method is that it does not reflect any probability information associated to the word transcriptions other than to truncate the groups of word transcriptions. Additionally, the above-described method does not provide any mechanism for including language probability information in the selection of the set of phrasal transcriptions.
Thus, there exists a need in the industry to refine the process of selecting a set of transcriptions such as to obtain an improved set of phrasal transcriptions capable of being used by speech recognition dictionary or by a text to speech synthesizer.
SUMMARY OF THE INVENTION
The present invention is directed to the generation of phrasal transcriptions.
In accordance with a broad aspect, the invention provides a method for generating a set of phrasal transcriptions suitable for use in a speech recognition dictionary. The method comprises providing an orthographic phrase comprising a set of vocabulary items. The method further comprises generating a group of word transcriptions for each vocabulary item in the orthographic phrase, each word transcription in the group of word transcriptions for a given vocabulary item being associated to an ordering data element. The ordering data elements esta
Sabourin Michael G.
Smith Kenneth W.
Dorvil Richemond
Nolan Daniel A
Nortel Networks Limited
LandOfFree
Method and apparatus for generating phrasal transcriptions does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for generating phrasal transcriptions, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for generating phrasal transcriptions will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2970958