Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-12-09
2002-01-29
Smits, Talivaldis I. (Department: 2641)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S235000, C704S260000
Reexamination Certificate
active
06343270
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates to speech processing systems and, more particularly, to a method and device for increasing the dialect precision and usability in speech recognition and text-to-speech systems.
2. Discussion of Related Prior Art
Generally, in a speech recognition system, each word of a vocabulary to be recognized is represented by a baseform wherein a word is divided for recognition purposes into a structure of phones, i.e. phonetic elements as shown in FIG.
1
. See also, F. Jelinek, “Continuous Speech Recognition by Statistical Methods”, Proceedings IEEE, Vol. 64, 1976, pp. 532-576, incorporated by reference herein.
These phones correspond generally to the sounds of vowels and consonants as are commonly used in phonetic alphabets. In actual speech, a portion of a word may have different pronunciations, as indicated in FIG.
2
.
FIG. 2
illustrates a freely choosable pronunciation alternative, with the first phone of the word having two pronunciation alternatives.
A typical speech recognition system would store a separate and distinct linear baseform representation for each pronunciation alternative, where each representation consists of a unique linear combination of phones or phonemes. For the “economics” exemplar, the speech recognition system would store two separate linear strings, as illustrated at FIG.
2
.
In addition to freely choosable pronunciation variations, typical speech recognition systems also store dialectal alternatives in a similar manner.
FIG. 3
illustrates a dialectal alternative for the exemplar “economics” illustrating both a New York City area and a Canadian pronunciation.
FIG. 3
illustrates two dialectal alternatives; however, any number of dialectal variations may be considered by the method.
FIG. 3
illustrates a dialectal variation at the fifth phone of the word. A typical speech recognition system would be required to store four separate linear baseform representations for the exemplar “economics” to account for a single freely choosable pronunciation alternative and a single dialectal alternative.
For certain applications storing each of the baseform representations of a word is acceptable; in the general case, however, it can lead to problems. If, for example, you discover that additional variation must be considered subsequent to an initial construction stage, the process of editing the pronunciation lexicon can become tedious and subject to errors as a consequence of making each change manually. Another associated drawback of storing every conceivable baseform representation of a word or phrase occurs in real-time applications where a primary objective of the speech recognition system is to minimize the error rate. The common element in such real-time applications is that the speech recognition system is not afforded the luxury of enrolling the speaker (i.e. determining his or her speech characteristics in a sample session). Typical real-time applications may include, for example, a person walking up to a kiosk in a mall or subscribing over the telephone. By pre-storing all of the possible baseform representations in the lexicon, the speech recognition is more error-prone given the greater number of choices and no capacity to develop a characterization model of an individual to weight one pronunciation and/or dialect over another.
Accordingly, it would be desirable to provide a method and device for reducing the size of the pronunciation lexicon by storing only the reasonable pronunciations for a particular dialect or set of dialects. It is also desirable to eliminate errors inherent in manually inputting one or more variant baseforms, where such variations can be on the order of fifty or more in certain applications. Further, it is also desirable to reduce the cost and drudgery associated with the manual input of changes to the pronunciation lexicon.
SUMMARY OF THE INVENTION
In accordance with the present invention, a method for increasing both dialect precision and usability in speech recognition and text-to-speech systems is described. The invention generates non-linear (i.e. encoded)baseform representations for words and phrases from a pronunciation lexicon. The baseform representations are encoded to incorporate both pronunciation variations and dialectal variations. The encoded baseform representations may be later expanded (i.e. decoded) into one or more linear dialect specific baseform representations, utilizing a set of dialect specific phonological rules. The method provides the additional capability for a user specified dialect independent mode, whereby all encoded baseform variations will be included as part of the decoded output lexicon.
According to an illustrative embodiment, words and phrases from a pronunciation lexicon are encoded for both pronunciation and dialectal variations. A single encoded (i.e. non-linear) baseform representation will be stored for each word or phrase that contains a pronunciation and/or dialectal variation. Note that not all words and phrases will contain such variations, and as such they will be stored unencoded as linear baseform representations. Special encoding symbols are used to encode the variations. The encoded baseform representations may be later decoded (i.e. expanded) any number of times as needed into linear output baseform representations that are either dialect specific or dialect independent, depending upon a user specified dialect preference.
In accordance with an embodiment of the present invention, a computer based pronunciation lexicon generation system is formed with a first data file comprised of an encoded lexicon of non-linear baseforms and a second data file having one or more sets of dialect specific phonological rules. The system further includes a computer processor which is operatively coupled to the first and second data files and generates a third output data file therefrom. The output data file is a decoded pronunciation lexicon comprised of a plurality of linear (i.e. decoded) baseform representations. The output data file is generated by the processor which applies dialect specific phonological rules from the second data file to encoded baseform representations in the first data file. In the case where a user does not specify a preferred dialect, all of the phonological rules from the rule set database will be used to decode the first data file.
In one aspect of the invention, a method for generating a dialect specific pronunciation lexicon from an encoded pronunciation lexicon comprises the steps of: constructing an encoded pronunciation lexicon having a plurality of encoded and unencoded baseforms; inputting one or more user specified dialects; selecting dialect specific phonological rules from a rule set database; and decoding the encoded pronunciation lexicon using the dialect specific phonological rules to yield a dialect specific decoded pronunciation lexicon.
The method of the present invention is advantageous because (a) it facilitates the straightforward generation of different baseform sets for different dialects thereby increasing recognition accuracy (b) it eliminates the errors inherent in inputting multiple, sometimes fifty or more, variant baseforms (c) it allows significantly easier updates and corrections because the baseform representation is more perspicuous (d) it requires far less input for the system designer who is establishing the baseforms.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description or illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
REFERENCES:
patent: 5694520 (1997-12-01), Lyberg
patent: 5865626 (1999-02-01), Beattie et al.
patent: 6061646 (2000-05-01), Martino et al.
patent: 6064963 (2000-05-01), Gainsboro
A. P. Breen, et al. “Designing the next generation of text-to-speech systems,” Proc. IEE Colloquium on Techniques for Speech Processing and their Applications, vol. 6, p. 1-5, 1994.*
Francis Kubala, et al. “Transcribing radio new
Bahl Lalit R.
Cohen Paul S.
F. Chau & Associates,LLP
International Business Machines - Corporation
Smits Talivaldis I.
LandOfFree
Method for increasing dialect precision and usability in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for increasing dialect precision and usability in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for increasing dialect precision and usability in... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2831466