Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-03-21
2003-07-29
Abebe, Daniel (Department: 2741)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S275000
Reexamination Certificate
active
06601029
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to voice processing apparatus and the like, and more particularly to voice processing systems that use speech recognition.
2. Description of the Related Art
Voice processing systems whereby callers interact over a telephone network (e.g. PSTN or Internet) with computerised equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically such systems ask a caller questions using prompts formed from one or more prerecorded audio segments, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephones. This approach has proved effective for simple interactions, but is clearly restricted in scope due to the limited number of available keys on a telephone. For example, alphabetical input is particularly difficult using DTMF keys.
There has therefore been an increasing tendency in recent years for voice processing systems to use speech recognition in order to augment DTMF input (N.B. the terms speech recognition and voice recognition are used interchangeably herein to denote the act of converting a spoken audio signal into text). The utilisation of speech recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller.
As an illustration of the above, WO96/25733 describes a voice response system which includes a prompt unit, a Voice Activity Detector (VAD), and a voice recognition unit. In this system, as a prompt is played to the caller, any input from the caller is passed to the VAD, together with the output from the prompt unit. This allows the VAD to perform echo cancellation on the incoming signal. Then, in response to the detection of voice by the VAD, the prompt is discontinued, and the caller input is switched to the recognition unit, thereby providing a barge-in facility.
Speech recognition in a telephony environment can be supported by a variety of hardware architectures. Many voice processing systems include a special DSP card for running speech recognition software. This card is connected to a line interface unit for the transfer of telephony data by a time division multiplex (TDM) bus. Most commercial voice processing systems, more particularly their line interface units and DSP cards, conform to one of two standard architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor Integration Protocol (MVIP). A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a speech recognition facility. This approach is somewhat more complex than the TDM approach, given the data communication and management required, but does offer significantly increased flexibility.
Speech recognition systems are generally used in telephony environments as cost-effective substitutes for human agents, and are adequate for performing simple, routine tasks. It is important that such tasks be performed accurately, otherwise there may be significant caller dissatisfaction, and also as quickly as possible, both to improve caller throughput, and because the owner of the voice processing system is often paying for the call via some freephone mechanism (eg an 0800 number), or because an outbound application is involved.
(Note that as used herein, the term “caller” simply indicates the party at the opposite end of a telephone connection to the voice processing system, rather than to specify which party actually initiated the telephone connection).
There has been an increase in recent years in the complexity of input permitted from the caller. This is supported firstly by the use of large vocabulary recognition systems, and secondly by supporting natural language understanding and dialogue management. As a simple example of this, a pizza ordering application several years ago might have gone through a menu to determine the desired pizza size, topping etc., with one prompt to elicit each property of the pizza from a caller. Now however, such an application may simply ask: “What type of pizza would you like?”. The caller response is passed to a large vocabulary speech recognition unit, with the recognised text then being processed in order to extract the relevant information describing the pizza.
The extraction of such information is typically performed by a natural language understanding (NLU) unit working in conjunction with a dialogue manager. These units have knowledge of grammar and syntax, which allows them to parse a caller response such as “I would like a large pizza with pepperoni” to extract the particular information desired by the application, namely that the desired pizza (a) is large, and (b) has a pepperoni topping.
Such natural language processing and dialogue management is described in “Building Intelligent Dialog Systems” by S McRoy, S Ali, A Restificar, and S Channarukul, in Intelligence, Spring 99, p14-23, and (at a more practical level) in “An Object-Oriented Approach to Dialogue Management in Spoken Language Systems” by R Sparks, L Meiskey, and H Brunner, in Human Factors in Computing Systems, 1994, p211-217.
The above approach presents a much more natural interface for callers, provides greater flexibility, and potentially can significantly reduce call handling time. However, large vocabulary speech recognition is still not completely reliable in all cases. Therefore, it is common when the statistical confidence on the recognition result is low, to play back to the caller their selection for confirmation. Whilst this leads to a more robust system, the approach taken by current systems can seem rather robotic, and does not lead to the most efficient dialogue with the caller.
SUMMARY OF THE INVENTION
Accordingly, the invention provides a method of operating a voice processing system comprising the steps of:
receiving spoken input from a user;
performing speech recognition to convert said spoken input into text equivalent;
identifying at least two information elements in said text equivalent, each having an uncertainty associated therewith;
selecting a prompt according to which of said at least two information elements has the greatest uncertainty associated therewith; and
playing out said selected prompt to the user.
Voice applications involving speech recognition frequently play back user input for confirmation. Therefore, according to the invention, where the user has input two items, these are ranked in terms of uncertainty (typically based on the speech recognition confidence level associated with that information element). The playback prompt is then structured to account for which information element has the greatest uncertainty associated with it. This helps structure the playback to maximise the efficiency of the confirmation process.
Thus in particular, in the preferred embodiment, the prompt is selected from a set of two or more possible prompts, each possible prompt containing the same information but having different emphasis. The different emphasis is typically achieved by varying prompt word order, but other parameters can be used instead, such as varying the volume and/or duration and/or pitch of one or more words in the prompt. This selection can be generally achieved by providing multiple pre-recorded prompts, by re-ordering or processing of individual audio segments forming a prompt, or by using a text to speech synthesis system with appropriate parameter settings.
The effect of such different emphasis is to focus user attention on the information element having the greatest uncertainty. This means that he/she is more likely to notice if a mistake has been made in the recognition or identification of the information elements. In addition, he/she is also more likely to stress the incorrect element in any repetition of the input, thereby improving the chances of performing a correct recognition the next time.
Where such repeat input is received, this is
Abebe Daniel
Kunzler and Associates
Schelkopf J. Bruce
LandOfFree
Voice processing apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Voice processing apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Voice processing apparatus will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3087236