Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-06-26
2002-04-23
Knepper, David D. (Department: 2645)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S256000, C704S242000
Reexamination Certificate
active
06377921
ABSTRACT:
BACKGROUND OF THE INVENTION
This invention relates to speech recognition and, more particularly, to apparatus and methods for identifying mismatches between assumed pronunciations of words, e.g., from transcriptions, and actual pronunciations of words, e.g., from acoustic data.
Speech recognition systems are being used in several areas today to transcribe speech into text. The success of this technology in simplifying man-machine interaction is stimulating the use of the technology in several applications such as transcribing dictation, voicemail, home banking, directory assistance, etc. Though it is possible to design a generic speech recognition system and then use it in a variety of different applications, it is generally the case that if the system is tailored to the particular application being addressed, it is possible to obtain much better performance than the generic system.
Most speech recognition systems consist of two components: an acoustic model that models the characteristics of speech, and a language model that models the characteristics of the particular spoken language. The parameters of both these models are generally estimated from training data from the application domain of interest.
In order to train the acoustic models, it is necessary to have acoustic data along with the corresponding transcription. For training the language model, it is necessary to have the transcriptions that represent typical sentences in the selected application domain.
Hence, with the goal of optimizing the performance in the selected application domain, it is often the case that much training data is collected from the domain. However, it is also often the case that only the acoustic data can be collected in this manner, and the data has to be transcribed later, possibly by a human listener. Further, it is the case that where spontaneous speech is concerned, it is relatively difficult to obtain verbatim transcriptions because of the existence of several mispronunciations, inconsistencies and errors in the speech, and the human transcription error rate is fairly high. This in turn has an implication on the estimation of the acoustic model parameters and, as is known, transcriptions with a significant amount of errors often lead to poorly estimated or corrupted acoustic models.
Accordingly, it would be highly advantageous to provide apparatus and methods to identify regions of the transcriptions that have errors. Then, it would be possible to post-process these regions, either automatically or by a human or a combination thereof, in order to refine or correct the transcriptions in this region alone.
Further, in most speech recognition systems, it is generally the case that words in the vocabulary are represented as a sequence of fundamental acoustic units such as phones (referred to as the baseform of the word). Also, it is often the case that the baseform representation of a word does not correspond to the manner in which the word is actually uttered. Accordingly, it would also be highly advantageous to provide apparatus and methods to identify such mismatches in the baseform representation and actual acoustic pronunciation of words.
Further, it is often the case that in spontaneous speech, due to co-articulation effects, the concatenation of the baseform representation of a group of words may not be an appropriate model, and it may be necessary to construct a specific baseform for the co-articulated word. For example, the phrase “going to” may commonly be pronounced “gonna.” Accordingly, it would also be highly advantageous to provide apparatus and methods for such a co-articulated word to be detected and allow for a specific baseform to be made for it (e.g., a baseform for “gonna”) rather than merely concatenating the baseforms of the non-co-articulated phrase (e.g., concatenating baseforms of words “going” and “to”).
Lastly, there may also be inconsistencies between a transcription and input acoustic data due to modeling inaccuracies in the speech recognizer. Accordingly, it would be highly advantageous to provide apparatus and methods for erroneous segments in the transcription to be identified, so that they can be corrected by other means.
SUMMARY OF THE INVENTION
The present invention provides apparatus and methods to identify mismatches between some given acoustic data and its supposedly verbatim transcription. It is to be appreciated that the transcription may be, for example, at the word level or phone level and the mismatches may arise due to, for example, inaccuracies in the word level transcription, poor baseform representation of words, background noise at the time the acoustic data was provided, or co-articulation effects in common phrases. The present invention includes starting with a transcription having errors and computing a Viterbi alignment of the acoustic data against the transcription. The words in the transcription are assumed to be expressed in terms of certain basic units or classes such as phones, syllables, words or phrases and the acoustic model is essentially composed of models for each of these different units. The process of Viterbi aligning the data against the transcription and computing probability scores serves to assign a certain probability to each instance of a unit class in the training data. Subsequently, for each class, a histogram of the scores of that class is computed from all instances of that class in the training data. Accordingly, the present invention advantageously identifies those instances of the class that correspond to the lowest scores in the histogram as “problem regions” where there is a mismatch between the acoustic data and the corresponding transcription. Subsequently, the transcription or baseform can be refined for these regions, either automatically or manually by a human listener, as will be explained. It is to be appreciated that the invention is applicable to identification of mismatches between a transcription and acoustic data associated with a training session or a real-time decoding session.
In one aspect of the invention, a method for identifying mismatches between acoustic data and a corresponding transcription, the transcription being expressed in terms of basic units, comprises the steps of: aligning the acoustic data with the corresponding transcription; computing a probability score for each instance of a basic unit in the acoustic data with respect to the transcription; generating a distribution for each basic unit; tagging, as mismatches, instances of a basic unit corresponding to a particular range of scores in the distribution for each basic unit based on a threshold value; and correcting the mismatches.
In another aspect of the invention, computer-based apparatus for identifying mismatches between acoustic data and a corresponding transcription associated with a speech recognition engine, the transcription being expressed in terms of phonetic units, comprises: a processor, operatively coupled to the speech recognition engine, for: aligning the acoustic data with the corresponding transcription; computing a probability score for each instance of a basic unit in the acoustic data with respect to the transcription; generating a distribution for each basic unit; tagging, as mismatches, instances of a basic unit corresponding to a particular range of scores in the distribution for each basic unit based on a threshold value; and correcting the mismatches.
REFERENCES:
patent: 4803729 (1989-02-01), Baker
patent: 5050215 (1991-09-01), Nishimura
patent: 5127055 (1992-06-01), Larkey
patent: 5685924 (1997-11-01), Stanley et al.
patent: 5839105 (1998-11-01), Ostendorf et al.
patent: 6026359 (2000-02-01), Yamaguchi et al.
patent: 6067517 (2000-05-01), Bahl et al.
Bahl Lalit R.
Padmanabhan Mukund
F. Chau & Associates LLP
International Business Machines - Corporation
Knepper David D.
LandOfFree
Identifying mismatches between assumed and actual... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Identifying mismatches between assumed and actual..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Identifying mismatches between assumed and actual... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2897531