Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis
Reexamination Certificate
1998-06-15
2001-06-05
Tsang, Fan (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Synthesis
C704S258000
Reexamination Certificate
active
06243680
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to a method and an apparatus for automatically performing desired actions in response to spoken requests. It is particularly applicable to a method and an apparatus for generating entries for a speech recognition dictionary, as may be used to automate partially or fully the training of a speech recognition dictionary in a speech recognition system. The method and apparatus may be used to train a speech recognition dictionary for a telephone directory assistance system, voice activated dialing (VAD), credit card number identification and other speech recognition enabled services.
BACKGROUND OF THE INVENTION
In addition to providing printed telephone directories, telephone companies provide information services to their subscribers. The services may include stock quotes, directory assistance and many others. In most of these applications, when the information requested can be expressed as a number or number sequence, the user is required to enter his request via a touch tone telephone. This is often aggravating for the user since he is usually obliged to make repetitive entries in order to obtain a single answer. This situation becomes even more difficult when the input information is a word or phrase. In these situations, the involvement of a human operator may be required to complete the desired task.
Because telephone companies are likely to handle a very large number of calls per year, the associated labour costs are very significant. Consequently, telephone companies and telephone equipment manufacturers have devoted considerable efforts to the development of systems that reduce the labour costs associated with providing information services on the telephone network. These efforts comprise the development of sophisticated speech processing and recognition systems that can be used in the context of telephone networks.
In typical speech recognition systems, the user enters his request using isolated word, connected word or continuous speech via a microphone or telephone set. The request may be a name, a city or any other type of information for which either a function is to be performed or information is to be supplied. If valid speech is detected, the speech recognition layer of the system is invoked in an attempt to recognize the unknown utterance. Typically entries in a speech recognition dictionary, typically including transcriptions associated to labels, are scored in order to determine the most likely match to the utterance.
In speech applications, it is desirable to create the speech recognition dictionary or to add entries to the speech recognition dictionary by simply providing sample utterances of a new word along with a textual representation of that word. For example, it may be required to add a new name and associated telephone to a voice activated dialing system. In another example, it may be desirable to add a new function in a robot control system instructing the robot to perform some new task. In order to achieve this, utterances of the new name or new function to add are gathered (typically 2 or 3 utterances). Based on these sample training utterances a new entry, generally comprising a transcription of the utterances, is created in the speech recognition dictionary and used for recognition purposes at a later time.
Traditionally, to get an accurate transcription of a spoken word, expert phoneticians listen to the words as they are spoken and transcribe them. This operation is time consuming and the labour costs associated with the expert phoneticians are significant. As a result systems providing automatic transcriptions to spoken words have been developed.
A common approach in generating the transcription for a new word is to obtain a series of training utterances of the same word and decode each of the utterances separately using a continuous allophone recogniser device. This approach generates a series of separate alternative acoustic sub-word representations each representation corresponding to different pronunciations of the same word. All these transcriptions are then stored in a speech recognition dictionary. Mathematically, this operation can be expressed as follows:
T
i
=arg max
P
(
t|Y
i
) Equation 1
i=1, . . . , p
t&egr;T
where T
i
is the transcription of the ith utterance, p is the number of training utterances, {Y
1
, Y
2
, Y
3
, . . . Y
p
} are the training utterances, T is the set of all possible transcriptions for any word and P(|) designates a conditional probability computation. A problem with this approach is that the computational cost of the recognition stage is very high since, for each word, the speech recognition system must score multiple entries in the dictionary. For a more detailed explanation, the reader is invited to consult R. Haeb-Umbach et al. “Automatic Transcription Of Unknown Words In A Speech Recognition System”, Proc. Of ICASSP'95, pp.840-843, 1995 and N. Jain et al. “Creating Speaker-Specific Phonetic Templates With A Speaker-Independent Phonetic Recognizer: Implications For Voice Dialing”, Proc. ICASSP'96, pp.881-884, 1996. The content of these documents is hereby incorporated by reference. A variant of this approach using a set of rules to automatically generate a set of likely transcriptions is described in “Automatic Rule-Based Generation of Word Pronunciation Networks” by Nick Cremelie and Jean-Pierre Martens, ESCA Eurospeech97, Rhodes, Greece, ISSN 1018-4074, pp. 2459-2462 whose contents are hereby incorporated by reference.
Another approach is to take the series of separate alternative transcriptions of the new word, generated as described in equation 1, and then select a single transcription which best represents all the utterances. Essentially, a transcription T
best
is chosen which is the most likely to have produced all utterances {Y
1
, Y
2
, Y
3
, . . . , Y
p
}. Mathematically, this operation can be expressed as follows:
T
best
=
argmax
t
∈
{
T
1
,
T
2
,
…
⁢
⁢
T
p
}
(
∏
i
=
1
p
⁢
P
(
Y
i
&RightBracketingBar;
⁢
t
)
)
Equation
⁢
⁢
2
For a more detailed explanation, the reader is invited to consult R. Haeb-Umbach et al. “Automatic Transcription Of Unknown Words In A Speech Recognition System”, Proc. Of ICASSP'95, pp.840-843, 1995 whose content is incorporated by reference. Choosing a single transcription for the new word reduces the memory space required for the dictionary and reduces the amount of time necessary to score the dictionary. However, the selected transcription merely reflects the acoustic information in the utterance that originated the transcription, and disregards the acoustic information of the utterances associated with the transcriptions that where rejected during the selection process.
Thus, there exists a need in the industry to refine the process of adding a new word to a speech recognition dictionary such as to obtain a more accurate representation for new entries and to reduce the computational costs at the recognition stage.
OBJECTS AND STATEMENT OF THE INVENTION
An object of the invention is to provide a method and apparatus that can be used for creating a transcription capable of being used to generate an entry in a speech recognition dictionary.
Another object of the invention is a computer readable storage medium containing a program element that instructs a computer to generate a transcription capable of being used to generate an entry in a speech recognition dictionary.
Another object of the invention is to provide an apparatus for creating an entry for a certain word in a speech recognition dictionary.
As embodied and broadly described herein the invention provides an apparatus for creating a transcription capable of being used to generate an entry in a speech recognition dictionary for a certain word, said apparatus comprising:
a first input for receiving an audio signal derived from an utterance of the certain word;
a second input for receiving data representative of an orthographic representation of
Gupta Vishwa Nath
Sabourin Michael
Nortel Networks Limited
Opsasnick Michael N.
Tsang Fan
LandOfFree
Method and apparatus for obtaining a transcription of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for obtaining a transcription of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for obtaining a transcription of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2541265