Context-dependent acoustic models for medium and large...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S255000, C704S203000

Reexamination Certificate

active

06571208

ABSTRACT:

BACKGROUND AND SUMMARY OF THE INVENTION
Small vocabulary speech recognition systems have as their basic units the words in the small vocabulary to be recognized. For instance, a system for recognizing the English alphabet will typically have 26 models, one model per letter of the alphabet. This approach is impractical for medium and large vocabulary speech recognition systems. These larger systems typically take as their basic units, the phonemes or syllables of a language. If a system contains one model (e.g. Hidden Markov Model) per phoneme of a language, it is called a system with “context-independent” acoustic models.
If a system employs different models for a given phoneme, depending on the identity of the surrounding phonemes, the system is said to employ “context-dependent” acoustic models. An allophone is a specialized version of a phoneme defined by its context. For instance, all the instances of ‘ae’ pronounced before ‘t’, as in “bat,” “fat,” etc. define an allophone of ‘ae’.
For most languages, the acoustic realization of a phoneme depends very strongly on the preceding and following phonemes: For instance, an ‘eh’ preceded by a ‘y’ (as in “yes”) is quite different from an ‘eh’ preceded by ‘s’ (as in ‘set’). Thus, for a system with a medium-sized or large vocabulary, the performance of context-dependent acoustic models is much better than that of context-independent models. Most practical applications of medium and large vocabulary recognition systems employ context-dependent acoustic models today.
Many context-dependent recognition systems today employ decision tree clustering to define the context-dependent, speaker-independent acoustic models. A tree-growing algorithm finds questions about the phonemes surrounding the phoneme of interest and splits apart acoustically dissimilar examples of the phoneme of interest. The result is a decision tree of yes-no questions for selecting the acoustic model that will best recognize a given allophone. Typically, the yes-no questions pertain to how the allophone appears in context (i.e., what are its neighboring phonemes).
The conventional decision tree defines for each phoneme a binary tree containing yes
o questions in the root node and in each intermediate node (children, grandchildren, etc. of the root node). The terminal nodes, or leaf nodes, contain the acoustic models designed for particular allophones of the phoneme. Thus, in use, the recognition system traverses the tree, branching ‘yes’ or ‘no’ based on the context of the phoneme in question until the leaf node containing the applicable model is identified. Thereafter the identified model is used for recognition.
Unfortunately, conventional allophone modeling can go wrong. We believe this is because current methods do not take into account the particular idiosyncrasies of each training speaker. Current methods assume that individual speaker idiosyncrasies will be averaged out if a large pool of training speakers is used. However, in practice, we have found that this assumption does not always hold. Conventional decision tree-based allophone models work fairly well when a new speaker's speech happens to resemble the speech of the training speaker population. However, conventional techniques break down when the new speaker's speech lies outside the domain of the training speaker population.
The present invention addresses the foregoing problem through a reduced dimensionality speaker space assessment technique that allows individual speaker idiosyncrasies to be rapidly identified and removed from the recognition equation, resulting in allophone models that are far more universally applicable and robust. The reduced dimensionality speaker space assessment is performed in a reduced dimensionality space that we call the eigenvoice space or eigenspace. One of the important advantages of our eigenvoice technique is speed. When a new speaker uses the recognizer, his or her speech is rapidly placed or projected into the Eigenspace derived from the training speaker population. Even the very first utterance by the new speaker can be used to place the new speaker into eigenspace. In eigenspace, the allophones may be represented with minimal influence by irrelevant factors such as each speaker's position in speaker space.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings. In the following detailed description, two basic embodiments are illustrated. Different variations of these embodiments are envisioned as will be appreciated by those skilled in this art.


REFERENCES:
patent: 4718088 (1988-01-01), Baker et al.
patent: 4817156 (1989-03-01), Bahl et al.
patent: 4829577 (1989-05-01), Kuroda et al.
patent: 4903035 (1990-02-01), Kropielnicki et al.
patent: 5046099 (1991-09-01), Nishimura
patent: 5050215 (1991-09-01), Nishimura
patent: 5127055 (1992-06-01), Larkey
patent: 5150449 (1992-09-01), Yoshida et al.
patent: 5170432 (1992-12-01), Hackbarth et al.
patent: 5233681 (1993-08-01), Bahl et al.
patent: 5280562 (1994-01-01), Bahl et al.
patent: 5293584 (1994-03-01), Brown et al.
patent: 5375173 (1994-12-01), Sanada et al.
patent: 5473728 (1995-12-01), Luginbuhl et al.
patent: 5522011 (1996-05-01), Epstein et al.
patent: 5579436 (1996-11-01), Chou et al.
patent: 5617486 (1997-04-01), Chow et al.
patent: 5651094 (1997-07-01), Takagi et al.
patent: 5664059 (1997-09-01), Zhao
patent: 5737723 (1998-04-01), Riley et al.
patent: 5778342 (1998-07-01), Erell et al.
patent: 5787394 (1998-07-01), Bahl et al.
patent: 5793891 (1998-08-01), Takahashi et al.
patent: 5794192 (1998-08-01), Zhao
patent: 5806029 (1998-09-01), Buhrke et al.
patent: 5812975 (1998-09-01), Kormori et al.
patent: 5825978 (1998-10-01), Digalakis et al.
patent: 5839105 (1998-11-01), Ostendorf et al.
patent: 5842163 (1998-11-01), Weintraub
patent: 5864810 (1999-01-01), Digalakis
patent: 5890114 (1999-03-01), Yi
patent: 5895447 (1999-04-01), Ittycheriah et al.
patent: 6016471 (2000-01-01), Kuhn et al.
patent: 6029132 (2000-02-01), Kuhn et al.
patent: 6138087 (2000-10-01), Budzinski
patent: 6163769 (2000-12-01), Acero et al.
patent: 6230131 (2001-05-01), Kuhn et al.
patent: 6233553 (2001-05-01), Contolini et al.
patent: 6263309 (2001-07-01), Nguyen et al.
patent: 6324512 (2001-11-01), Junqua et al.
patent: 6343267 (2002-01-01), Kuhn et al.
Kuhn et al (“Improved Decision Trees For Phonetic Modeling”, International Conference on Acoustics, Speech, and Signal Processing, pp. 552-555, May 1995).*
Lazarides et al (“Improving Decision Trees For Acoustic Modeling”, 4th International Conference on Spoken Language, pp. 1053-1056 Oct. 1996).*
Abe et al (“Hierarchial-Clustering Of Parametric Data With Application To The Parametric Eigenspace Method”, International Conference on Image Processing, pp.: 118-122 Oct. 1999).*
V. Digalakis, et al., Rapid speech recognizer adaptation to new speakers, Tech. Univ. of Crete, Chania, Greece, pp. 765-768, vol. 2, Mar. 1999.
S.J. Cox, et al., Simultaneous speaker normalisation and utterance labelling using Bayesian
eural net techniques, British Telecom Res. Lab., Ipswich, UK, pp. 161-164, vol. 1, Apr. 1990.
Yunxin Zhao, An acoustic-phonetic-based speaker adaptation technique for improving speaker-independent continuous speech recognition, Speech technol. Lab., Panasonic Technol. Inc., Santa Barbara, CA, USA, pp. 380-394, vol. 2, Jul. 1994.
V. Abrash et al., Acoustic adaptation using nonlinear transformations of HMM parameters, Speech Res. & Technol. Lab., SRI Int., Menlo Park, CA, USA, pp. 729-732, vol. 2, May 1996.
R. Kuhn, et al., Eigenfaces and eigenvoices: dimensionality reduction for specialized pattern recognition, Panasonic Technol.-STL, Santa Barbara, CA, USA, pp. 71-76, Dec. 1998.
J.-L. Gauvain, et al., Improved acoustic modeling with Bayesian learning, AT&T Bell Labs., Murray Hill, NJ, USA, pp. 481-484, vol. 1, Mar. 1992.
Ming-Whei Feng, Speaker Adaptation Based on Spectral Normalization and Dynamic HMM Parameter Adaptation, GTE Laboratories Inc., IEEE, 1995, pp. 704-707.
J. M

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Context-dependent acoustic models for medium and large... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Context-dependent acoustic models for medium and large..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Context-dependent acoustic models for medium and large... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3079064

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.