Speech recognition accuracy in a multimodal input system

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S236000, C704S239000, C704S255000

Reexamination Certificate

active

06823308

ABSTRACT:

The present invention generally relates to the improvement of the accuracy of speech recognition in a complementary multimodal input system.
Interfaces which use speech as an input and at least one further modality input are known as multimodal systems. In multimodal systems where two modalities contain the same information content they are termed redundant e.g. speech recognition and lip movement recognition. Where two modalities each contain their own information they are termed complementary e.g. speech recognition and eyebrow movement recognition (since although eyebrow movement can be related to speech, it can include its own information e.g. emotion), and speech recognition and pointing events such as mouse clicks. Complementary modality input systems provide a more natural and powerful method of communication than any single modality can alone. The further modalities can for example comprise pointing events from pointing devices such as a mouse, touch screen, joystick, tracker ball, or a track pad, a pen input in which handwriting is recognised, or gesture recognition. Thus in complementary multimodal systems, parallel multimodal inputs are received and processed in order to control a system such as a computer.
It is known that speech recognition engines do not always perform correct recognition on the speech.
It is therefore an object of the present invention to improve the accuracy of speech recognition using the further modality inputs in a complementary multimodal system.
In accordance with a first aspect, the present invention provides a speech recognition method and apparatus for use in a complementary multimodal input system in which a digitized speech input as a first modality and data in at least one further complementary modality is received. Features in the digitized speech are extracted or identified. Also features in the data in each further modality input are extracted or identified. Recognition is then performed on the words by comparing the identified features with states in models for words. The models have states for the recognition of speech and, where words have features in one or more further modality associated with them, models for those words also have states for the recognition of associated events in each further modality. Thus the models for the words used in the recognition utilise not just the features of the first modality input, but also the features of at least one further modality input. This greatly improves the recognition accuracy, since more data is available from a different source of information to aid recognition. The recognition engine will not recognise words as words which should have further modality inputs if those further modality inputs have not been received in association with the spoken words.
This invention is applicable to a complementary multimodal input system in which the improved speech recognition technique is used for the input of recognised words, and inputs to a processing system are generated by processing the recognised words and data from at least one further modality input in accordance with multimodal grammar rules. Thus in this aspect of the present invention, a more accurate input is achieved to the processing system.
In one embodiment, the models comprise an array of states having a dimensionality equal to the number of modes in the received multimodal input. Recognition then preferably takes place by sequentially transiting between states in a first dimension upon receipt of a feature in the speech input and transiting along the states in the further dimension or each further dimension upon receipt of the appropriate feature in the further modality input or each further modality input. Thus in one embodiment, the models for the words use states of Hidden Markov models for speech and states of finite state machines for the further modality input or each further modality input. The transitions between states in the model have probabilities associated with them in one embodiment thereby resulting in an accumulated probability during the recognition process. Thus in this embodiment, in accordance with conventional speech recognition processes, a word is recognised which has the highest accumulated probability at a final state in the word model.
The present invention can be implemented in dedicated specifically designed hardware. However, more preferably the present invention is implemented using a general purpose computer controlled by software. Thus the present invention encompasses program code for controlling a processor to implement the technique. The present invention can thus be embodied as a carrier medium carrying the program code. Such a carrier medium, can for example comprise a storage medium such as a floppy disk, CD ROM, hard disk drive, or programmable read only memory device, or the carrier medium can comprise a signal such as an electrical signal carried over a network such as the Internet.


REFERENCES:
patent: 4876720 (1989-10-01), Kaneko et al.
patent: 5689616 (1997-11-01), Li
patent: 5748974 (1998-05-01), Johnson
patent: 5781179 (1998-07-01), Nakajima et al.
patent: 5895464 (1999-04-01), Bhandari et al.
patent: 5960395 (1999-09-01), Tzirkel-Hancock
patent: 6115683 (2000-09-01), Burstein et al.
patent: 6411925 (2002-06-01), Keiller
patent: 2002/0111798 (2002-08-01), Huang
patent: 08-095734 (1996-12-01), None
patent: WO 99/46763 (1999-09-01), None
patent: WO 00/08547 (2000-02-01), None
Johnston et al (“Unification-Based Multimodal Integration”, Assoc. Computational Linguistics, 1997).*
Mellish, et al., “Techniques in Natural Language Processing 1” (1994).
Allen, et al., “Chart Parsing”, Chapter 3, “Natural Language Understanding” (1987 (1stpublishing) & 1994 (2ndpublishing)).
U.S. patent application Ser. No. 09/409,250, Keiller, filed Jun. 2002.*
U.S. patent application Ser. No. 09/409,249, Keiller, filed Sep. 1999.*
U.S. patent application Ser. No. 09/669,510, Fortescue et al., filed Oct. 2000.*
U.S. patent application Ser. No. 09/652,932, Keiller, filed May 2000.*
U.S. patent application Ser. No. 09/551,909, Rees et al., filed Apr. 2000.*
WorldNet® (“Princeton University Cognitive Science Laboratory Internet page”, http://www.cogsci.princeton.edu), Jan. 2001.*
Johnston et al (“Unification-Based Multimodal Integration”, Assoc. Computational Linguistics, 1997).*
“Unification-based Multimodal Integration”, Johnson, et al., In the Proceedings of the 35thAnnual meeting of the Association for Computational Linguistics and 8thConference of the European Chapter of the Association for Computational Linguistics, 1997, pp. 281-288.
“Research in Multimedia and Multimodal Parsing anf Generation”, Mark T. Maybury—Conference proceedings of Coling 94, Computational Linguistics. International Conference. 15th, Aug. 1994 Kyoto, Japan.
“Put That Where? Voice and Gesture at the Graphics Interface” M. Billinghurst—ACM SIGGRAPH vol. 32, No. 4, Nov. 1998.
“Pearl: A Probabilistic Chart Parser” Magerman et al in Proceedings, European ACL, Apr. 1991. Also published in Proceedings, Second International Workshop for Parsing Technologies, Feb. 1991.
“A Generic Platform for Addressing the Multimodal Challenge” Laurence Nigay, et al.—Conference proceedings on Human Factors in computing systems, May 7-11, 1995 Denver, CO USA.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speech recognition accuracy in a multimodal input system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speech recognition accuracy in a multimodal input system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speech recognition accuracy in a multimodal input system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3315387

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.