Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
2000-06-13
2002-12-17
Dorvil, Richemond (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S253000
Reexamination Certificate
active
06496799
ABSTRACT:
BACKGROUND OF THE INVENTION
1. The Present Invention
The present invention relates to voice processing apparatus and the like, and more particularly to voice processing systems that use speech recognition.
2. Description of the Related Art
Voice processing systems whereby callers interact over a telephone network (e.g. PSTN or Internet) with computerised equipment are very well-known in the art, and include voice mail systems, voice response units, and so on. Typically such systems ask a caller questions using prompts formed from one or more prerecorded audio segments, and the caller inputs answers by pressing dual tone multiple frequency (DTMF) keys on their telephones. This approach has proved effective for simple interactions, but is clearly restricted in scope due to the limited number of available keys on a telephone. For example, alphabetical input is particularly difficult using DTMF keys.
There has therefore been an increasing tendency in recent years for voice processing systems to use speech recognition in order to augment DTMF input (N.B. the term speech recognition are denote the act of converting a spoken audio signal into text). The utilisation of speech recognition permits the handling of callers who do not have a DTMF phone, and also the acquisition of more complex information beyond simple numerals from the caller.
As an illustration of the above, WO96/25733 describes a voice response system which includes a prompt unit, a Voice Activity Detector (VAD), and a speech recognition unit. In this system, as a prompt is played to the caller, any input from the caller is passed to the VAD, together with the output from the prompt unit. This allows the VAD to perform echo cancellation on the incoming signal. Then, in response to the detection of voice by the VAD, the prompt is discontinued, and the caller input is switched to the recognition unit, thereby providing a barge-in facility.
Speech recognition in a telephony environment can be supported by a variety of hardware architectures. Many voice processing systems include a special DSP card for running speech recognition software. This card is connected to a line interface unit for the transfer of telephony data by a time division multiplex (TDM) bus. Most commercial voice processing systems, more particularly their line interface units and DSP cards, conform to one of two standard architectures: either the Signal Computing System Architecture (SCSA), or the Multi-vendor Integration Protocol (MVIP). A somewhat different configuration is described in GB 2280820, in which a voice processing system is connected via a local area network to a remote server, which provides a speech recognition facility. This approach is somewhat more complex than the TDM approach, given the data communication and management required, but does offer significantly increased flexibility.
Speech recognition systems are generally used in telephony environments as cost-effective substitutes for human agents, and are adequate for performing simple, routine tasks. It is important that such tasks be performed accurately, otherwise there may be significant caller dissatisfaction, and also as quickly as possible, both to improve caller throughput, and because the owner of the voice processing system is often paying for the call via some free phone mechanism (e.g. an 0800 number), or because an outbound application is involved.
(Note that as used herein, the term “caller” simply indicates the party at the opposite end of a telephone connection to the voice processing system, rather than to specify which party actually initiated the telephone connection).
There has been an increase in recent years in the complexity of input permitted from the caller. This is supported firstly by the use of large vocabulary recognition systems, and secondly by supporting natural language understanding and dialogue management. As a simple example of this, a pizza ordering application several years ago might have gone through a menu to determine the desired pizza size, topping etc., with one prompt to elicit each property of the pizza from a caller. Now however, such an application may simply ask: “What type of pizza would you like”. The caller response is passed to a large vocabulary continuous speech recognition unit, with the recognised text then being processed in order to extract the relevant information describing the pizza.
The extraction of such information is typically performed by a natural language understanding (NLU) unit working in conjunction with a dialogue manager. These units have knowledge of grammar and syntax, which allows them to parse a caller response such as “I would like a large pizza with pepperoni” to extract the particular information desired by the application, namely that the desired pizza (a) is large, and (b) has a pepperoni topping. The dialogue manager further provides flexibility in terms of generating prompts (perhaps using text-to-speech synthesis) to acquire specific information from a caller.
The above approach presents a much more natural interface for callers, provides greater flexibility, and potentially can significantly reduce call handling time. However, the increased flexibility also increases the scope for caller confusion. In such cases call efficiency can actually be reduced, and a lost call may result. Prior art voice processing systems have not addressed the problem of caller confusion or uncertainty that is sometimes an inevitable consequence of trying to support a caller interface that is more natural, but at the same time also more complex.
SUMMARY OF THE INVENTION
Accordingly, the invention provides a method of operating a voice processing system comprising the steps of:
receiving spoken input from a user;
performing speech recognition to convert said spoken input into text equivalent;
analysing at least one semantic or prosodic property of said spoken input by looking for task words in the text equivalent of the spoken input; and
responsive to said analysis, determining that the user input has effectively completed if there has not been a task word for more than a predetermined period of time.
The invention typically finds application in a telephony environment, in which the voice processing system and the user communicate with each other over a telephone network. In this situation, the spoken input is received over a telephone connection, and the voice processing system may itself play out prompts over the telephone connection, such as in response to a determination that the caller input has effectively been completed. The particular prompt played back to the caller in these circumstances may of course be dependent on what information the caller has so far provided to the voice processing system.
Underlying the present invention is the fact that conventional human dialogue is regulated by the concept of turn-taking, with linguistic cues that indicate when one party has finished speaking, and is expecting or inviting the other party to take over. Prior art voice processing systems have not been sensitive to such cues, and so seem extremely artificial in terms of the dialogue that they support. This in turn can cause difficulties for callers trying to use such systems, particularly if they have relatively little experience with such man-machine interfaces.
Unlike prior art systems, the present invention allows a determination of when the caller input has effectively (rather than actually) been completed. In other words, it detects not when the caller has stopped speaking altogether, but rather when the caller has stopped saying anything useful or relevant. This is achieved by analysing at least one semantic or prosodic property of said spoken input. The intention is firstly to assist more quickly callers who are in difficulty (whether or not they are conscious of the fact), and secondly to speed up call handling by interrupting callers who are giving lots of irrelevant information. The naturalness of the caller interface can also be improved by this approach, since the techniques employed mirror to a certain extent
Clay A. Bruce
Dorvil Richemond
Nolan Daniel A.
LandOfFree
End-of-utterance determination for voice processing does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with End-of-utterance determination for voice processing, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and End-of-utterance determination for voice processing will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2986084