Speech synthesis apparatus and selection method

Data processing: speech signal processing – linguistics – language – Speech signal processing – Synthesis

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S260000, C704S270100

Reexamination Certificate

active

06725199

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to a speech synthesis apparatus and a method of selecting a synthesis engine for a particular speech application.
BACKGROUND OF THE INVENTION
FIG. 1
of the accompanying drawings is a block diagram of an exemplary prior-art speech system comprising an input channel
11
(including speech recognizer
5
) for converting user speech into semantic input for dialog manager
7
, and an output channel (including text-to-speech converter (TTS)
6
) for receiving semantic output from the dialog manager for conversion to speech. The dialog manager
7
is responsible for managing a dialog exchange with a user in accordance with a speech application script, here represented by tagged script pages
15
. This exemplary speech system is particularly suitable for use as a voice browser with the system being adapted to interpret mark-up tags, in pages
15
, from, for example, four different voice markup languages, namely:
dialog markup language tags that specify voice dialog behavior;
multimodal markup language tags that extend the dialog markup language to support other input modes (keyboard, mouse, etc.) and output modes (e.g. display);
speech grammar markup language tags that specify the grammar of user input; and
speech synthesis markup language tags that specify voice characteristics, types of sentences, word emphasis, etc.
When a page
15
is loaded into the speech system, dialog manager
7
determines from the dialog tags and multimodal tags what actions are to be taken (the dialog manager being programmed to understand both the dialog and multimodal languages
19
). These actions may include auxiliary functions
18
(available at any time during page processing) accessible through application program interfaces (APIs) and including such things as database lookups, user identity and validation, telephone call control etc. When speech output to the user is called for, the semantics of the output are passed, with any associated speech synthesis tags, to output channel
12
where a language generator
23
produces the final text to be rendered into speech by text-to-speech converter
6
and output (generally via a communications link) to speaker
17
. In the simplest case, the text to be rendered into speech is fully specified in the voice page
15
and the language generator
23
is not required for generating the final output text; however, in more complex cases, only semantic elements are passed, embedded in tags of a natural language semantics markup language (not depicted in
FIG. 1
) that is understood by the language generator. The TTS converter
6
takes account of the speech synthesis tags when effecting text to speech conversion for which purpose it is cognizant of the speech synthesis markup language
25
.
User speech input is received by microphone
16
and supplied (generally via a communications link) to an input channel of the speech system. Speech recognizer
5
generates text which is fed to a language understanding module
21
to produce semantics of the input for passing to the dialog manager
7
. The speech recognizer
5
and language understanding module
21
work according to specific lexicon and grammar markup language
22
and, of course, take account of any grammar tags related to the current input that appear in page
15
. The semantic output to the dialog manager
7
may simply be a permitted input word or may be more complex and include embedded tags of a natural language semantics markup language. The dialog manager
7
determines what action to take next (including, for example, fetching another page) based on the received user input and the dialog tags in the current page
15
.
Any multimodal tags in the voice page
15
are used to control and interpret multimodal input/output. Such input/output is enabled by an appropriate recogniser
27
in the input channel
11
and an appropriate output constructor
28
in the output channel
12
.
A barge-in control functional block
29
determines when user speech input is permitted over system speech output. Allowing barge-in requires careful management and must minimize the risk of extraneous noises being misinterpreted as user barge-in with a resultant inappropriate cessation of system output. A typical minimal barge-in arrangement in the case of telephony applications is to permit the user to interrupt only upon pressing a specific dual tone multi-frequency (DTMF) key, the control block
29
then recognizing the tone pattern and informing the dialog manager that it should stop talking and start listening. An alternative barge-in policy is to only recognize user speech input at certain points in a dialog, such as at the end of specific dialog sentences, not themselves marking the end of the system's “turn” in the dialog. This can be achieved by having the dialog manager notify the barge-in control block of the occurrence of such points in the system output, the block
29
then checking to see if the user starts to speak in the immediate following period. Rather than completely ignoring user speech during certain times, the barge-in control can be arranged to reduce the responsiveness of the input channel so that the risk of a barge-in being wrongly identified are minimized. If barge-in is permitted at any stage, it is preferable to require the recognizer to have ‘recognized’ a portion of user input before barge-in is determined to have occurred. However barge-in is identified, the dialog manager can be set to stop immediately, to continue to the end of the next phrase, or to continue to the end of the system's turn.
Whatever its precise form, the speech system can be located at any point between the user and the speech application script server. It will be appreciated that whilst the
FIG. 1
system is useful in illustrating typical elements of a speech system, it represents only one possible arrangement of the multitude of possible arrangements for such systems.
Because a speech system is fundamentally trying to do what humans do very well, most improvements in speech systems have come about as a result of insights into how humans handle speech input and output. Humans have become very adapt at conveying information through the languages of speech and gesture. When listening to a conversation, humans are continuously building and refining mental models of the concepts being convey. These models are derived, not only from what is heard, but also, from how well the hearer thinks they have heard what was spoken. This distinction, between what and how well individuals have heard, is important. A measure of confidence in the ability to hear and distinguish between concepts, is critical to understanding and the construction of meaningful dialogue.
In automatic speech recognition, there are clues to the effectiveness of the recognition process. The closer competing recognition hypotheses are to one-another, the more likely there is confusion. Likewise, the further the test data is from the trained models, the more likely errors will arise. By extracting such observations during recognition, a separate classifier can be trained on correct hypotheses—such a system is described in the paper “Recognition Confidence Scoring for Use in Speech understanding Systems”, T J Hazen, T Buraniak, J Polifroni, and S Seneff, Proc. ISCA Tutorial and Research Workshop: ASR2000, Paris, France, September 2000.
FIG. 2
of the accompanying drawings depicts the system described in the paper and shows how, during the recognition of a test utterance, a speech recognizer
5
is arranged to generate a feature vector
31
that is passed to a separate classifier
32
where a confidence score (or a simply accept/reject decision) is generated. This score is then passed on to the natural language understanding component
21
of the system.
So far as speech generation is concerned, the ultimate test of a speech output system is its overall quality (particularly intelligibility and naturalness) to a human. As a result, the traditional approach to assessing speech synthesis has been to perform

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speech synthesis apparatus and selection method does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speech synthesis apparatus and selection method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech synthesis apparatus and selection method will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3195026

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.