Speaker model adaptation via network of similar users

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S270100

Reexamination Certificate

active

06442519

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention is related to speech recognition and more particularly to speech recognition on multiple connected computer systems connected together over a network.
2. Background Description
Automatic speech recognition (ASR) systems for voice dictation and the like use any of several well known approaches to for word recognition.
For example, L. R. Bahl, P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo, and M. Picheny, “Robust Methods for Using Context-dependent Features and Models in Continuous Speech Recognizer,”
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing
, vol. I, pp. 533-36, Adelaide, 1994, describe an acoustic ranking method useful for speech recognition. Acoustic decision trees, also useful for speech recognition are described by L. R. Bahl, P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo, and M. Picheny, in “Decision Trees for Phonological Rules in Continuous Speech,”
Proceedings of the
1991
International Conference on Acoustic, Speech, and Signal Processing
, Toronto, Canada, May 1991. Frederick Jelinek in
Statistical Methods for Speech Recognition
, The MIT Press, Cambridge, January 1999, describes identifying parameters that control decoding process.
While generally recognizing spoken words with a relatively high degree of accuracy, especially in a single user system, these prior speech recognition systems still, frequently, make inappropriate recognition errors. Generally, for single user systems, these errors can be reduced with additional user specific training. However, additional training time and increased data volume that must be handled during training are undesirable. So, for expediency, recognition accuracy is traded to minimize training time and data.
Speaker independent automatic speech recognition systems, such as what are normally referred to as interactive voice response systems, have a different set of problems, because they are intended to recognize speech from a wide variety of individual speakers. Typically, the approach with speaker independent ASR systems is to improve recognition accuracy by assigning individual speakers or recognition system users to user clusters. User clusters are groups of users with similar speech characteristics or patterns. As each speaker or user uses the system, the speaker is identified as belonging to one cluster. For each user cluster, acoustic prototypes are developed and are used for speech decoding.
For example, speakers may be clustered, according to language or accent. Various techniques for language identification are taught by D. Matrouf, M. Adda-Decker, L. Lamel and J. Gauvain, in “Language Identification Incorporating Lexical Information” in
Proceedings of the
1998
International Conference on Spoken Language Processing
(ICSLP 98), Sydney, Australia, December 1998. A well known method of determining an accent from acoustic features is taught by M. Lincoln, S. Cox and S. Ringland, in “A Comparison of Two Unsupervised Approaches to Accent Identification”
Proceedings of the
1998
International Conference on Spoken Language Processing
(ICSLP 98), Sydney, Australia, December 1998. However, the approach of Lincoln et al., if there is a very large speaker variability, as is normally the case, that variability may not be accounted for in training. Accordingly, speaker clusters that are accumulated in a normal ASR training period, generally, do not provide for all potential ASR users.
Consequently, to provide some improvement over speaker dependent methods, ASR decoding system approaches are used that are based on various adaptation schemes for acoustic models. These recognition adaptation schemes use additional data that is gathered subsequent to training by the ASR system every time a user dictates to the system. The speaker or user, usually, interactively corrects any errors in the recognition result, and those corrected scripts are used for what is normally referred to as a supervised adaptation.
See for example, Jerome R. Bellegarda, in “Context-dependent Vector Clustering for Speech Recognition,”
in Automatic Speech and Speaker Recognition
, edited by Chin-Hui Lee, Frank K. Song, 1996, Kluwer academic Publishers, Boston, pp. 133-153 which teaches an adaptation of acoustic prototypes in response to subsequent speech data collected from other users. Also, M. J. F. Gales and P.C. Woodland, “Mean and variance adaptation within the MLLR framework,”
Computer Speech and Language
(1996) 10, 249-264 teach incremental adaptation of HMM parameters derived from speech data from additional subsequent users.
The drawback with the above approaches of Bellegarda or Gales et al. is that during typical dictation sessions the user uses a relatively small number of phrases. So, it may take several user sessions to gather sufficient acoustic data to show any significant recognition accuracy improvement using such a supervised adaptation procedure. As might be expected, in the initial sessions the decoding accuracy may be very low, requiring significant interactive error correction.
Further, similar or even worse problems arise in unsupervised ASR applications when users do not correct ASR output. For example, unsupervised ASR is used in voice response systems wherein each user calls in to a service that uses ASR to process user voice input. C.H. Lee and J.L. Gauvain, “Bayesian adaptive Learning and MAP Estimation of HMM”, in
Automatic Speech and Speaker Recognition
, edited by Chin-Hui Lee, Frank K. Song, 1996, Kluwer academic Publishers, Boston, pp. 109-132 describe for supervised and unsupervised acoustic model adaptation methods. While it is still possible to adapt speech recognition for any new users using unsupervised adaptation, sufficient data must be collected prior to unsupervised use to insure adequate decoding accuracy for every new user.
Thus, there is a need for increasing the amount of usable acoustic data that are available for speech recognition of individual speakers in supervised and unsupervised speech recognition sessions.
SUMMARY OF THE INVENTION
It is a purpose of the invention to improve speech recognition by computers;
It is yet another purpose of the invention to expand the data available for speech recognition.
The present invention is a speech recognition system, method and program product for recognizing speech input from computer users connected together over a network of computers, each computer including at least one user based acoustic model trained for a particular user. Computer users on the network are clustered into classes of similar users according their similarities, including characteristics nationality, profession, sex, age, etc. Characteristics of users are collected from databases over the network and from users using the speech recognition system and distributed over the networks during or after user activities. As recognition progresses, similar language models among similar users are identified on the network. The acoustic models include an acoustic model domain, with similar acoustic models being clustered according to an identified domain. Existing acoustic models are modified in response to user production activities. Update information, including information about user activities and user acoustic model data, is transmitted over the network. Acoustic models improve for users that are connected over the network as similar users use their respective voice recognition system.


REFERENCES:
patent: 5664058 (1997-09-01), Vysotsky
patent: 5864807 (1999-01-01), Campbell et al.
patent: 5895447 (1999-04-01), Ittycheriah et al.
patent: 5897616 (1999-04-01), Kanevsky et al.
patent: 5950157 (1999-09-01), Heck et al.
patent: 6088669 (2000-07-01), Maes
patent: 6141641 (2000-10-01), Hwang et al.
patent: 6163769 (2000-12-01), Acero et al.
patent: 6182037 (2001-01-01), Maes
patent: 6182038 (2001-01-01), Balakrishnan et al.
patent: 6253179 (2001-06-01), Beigi et al.
patent: 6327568 (2001-12-01), Joost
patent: 6363348 (2002-03-01), Besling et al.
L.R. Bahl, P.V. de Souza, P.S. G

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Speaker model adaptation via network of similar users does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Speaker model adaptation via network of similar users, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Speaker model adaptation via network of similar users will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2946970

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.