Method and system for generating semi-literal transcripts...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S270000, C704S255000

Reexamination Certificate

active

06535849

ABSTRACT:

TECHNICAL FIELD
This invention relates to speech recognition methods and systems. More particularly, this invention relates to computerized methods and systems for generating semi-literal transcripts that may be used as source data for acoustic and language models for a speech recognition system and for other purposes where literal transcripts could be used.
BACKGROUND
Speech recognition systems, or speech recognizers, use recorded speech as an input and generate, or attempt to generate, a transcript of the spoken words in the speech recording. The recorded speech may come in a variety of forms; one common form for recorded speech is a digital recording that may be a mu-law encoded 8-bit audio digital signal.
Speech recognizers are commonly available. Speech recognizers use models of previous speech to assist in decoding a given utterance in a speech recording. One such commercial speech recognizer is the Truetalk product developed by Entropic Inc. This speech recognizer, which runs on a computer, in general comprises an experience base and pattern recognition code to drive the speech recognizer. The experience base contains important components of the speech recognizer, and may use a variety of models in speech recognition. The primary categories of models are acoustic models and language models.
The acoustic models of the speech recognizer may contain a set of models of sounds (sometimes called phonemes) and sound sequences (triphones). Each sound used in common speech may therefore be represented by a model within the acoustic models. For instance, the sounds “k,” “ae” and “t” (which together form the word “cat”) may be represented within the acoustic models. The acoustic models are used to assist in the recognition of the phonetic sequences that support the speech recognizer's selection of the most likely words of a given utterance, and the acoustic models use statistical representations to accomplish this task.
The language models may aid in determining the occurrence of words by applying known patterns of occurrences of words within speech. For instance, the language model may be able to determine the words from the context or from patterns of occurrence of certain words in spoken language.
The Truetalk speech recognizer contains three inter-connected modules within the experience base: a set of acoustic models, a language model, and a pronunciation dictionary. The three modules function together to recognize words in spoken speech. The pronunciation dictionary may be a set of models that is capable of combining the sounds within the acoustic models to form words. For example, the pronunciation dictionary may include models that can combine the “k,” “ae” and “t” sounds from the acoustic models to form the word “cat.” Although the speech recognizer described herein will be described with reference to the English language, the modules may be adapted to perform word recognition for other languages.
Commercial speech recognizers generally come with generic versions of the experience base. Some of these speech recognizers, such as the Truetalk product by Entropic, Inc., allow the user to train, modify and add to the models. The models, for instance, may be modified so that filled pause “words,” such as “um” or “ah,” are represented in the data used to train the models and so that patterns of occurrence are modeled for these “words.” A large number of words (on the order of between 2 million and 500 million) may be used to train the language model and the acoustic models. The models may be person-specific, such as for specific users with different accents or grammatical patterns, or specific to certain contexts, such as the medical field. If the models are limited by person or context, the models may require less training to determine patterns of occurrence of words in speech. The models, however, need not be person or context specific. The significant point is that the models, and in particular the acoustic models and language models, may be trained or modified so that they perform better to recognize speech for a given speaker or context.
Literal transcripts have traditionally been used to train and modify acoustic models and language models. The literal transcript and the recorded speech are submitted to software that generates an acoustic model or language model or that modifies a given acoustic model or language model for transcribed words. This software is well established and commonly used by those skilled in the art. One problem with this method of producing acoustic models or language models, however, is that a literal transcript must be generated for use in building the model. A “literal transcript” of recorded speech, as used in this specification, means a transcript that includes all spoken words or utterances in the recorded speech, including filled pause words (such as “um” and “ah”), repair instructions in dictated speech (such as “go left, no, I mean go right”), grammatical errors, and any pleasantries and asides dictated for the benefit of the human transcriptionist (such as “end of dictation; thank you,” or “new paragraph”). Such literal transcripts are generated by human transcriptionists, which is a labor intensive and expensive task, especially when the end product of a literal transcript is not the desired output in the transcription business.
The commercial transcription business produces partial transcripts as the desired output. These partial transcripts typically remove filled pause words, repairs, pleasantries and asides, and grammatical errors. A “partial transcript,” as used throughout this specification, is what the dictator of the speech desires for the outcome, rather than a literal transcript of the dictated speech. It is, in other words, what the human transcriptionist generates from recorded speech, which typically includes correcting grammatical errors, repetitive speech, partial sentences, and other speech that should not be included in a commercial transcript. Unlike literal transcripts, which have no real commercial value, partial transcripts are the desired end product of the transcription business. Although partial transcripts are commonly generated in the transcription business, unlike literal transcripts, they miss and alter much of the spoken speech in a recording and are therefore commonly of limited value as a data source for building or modifying the models of a speech recognizer.
A need exists for a method and system that can use commonly available partial transcripts of recorded speech to develop or modify the models of a speech recognizer.
SUMMARY
One embodiment of the invention is a method for generating a semi-literal transcript from a partial transcript of recorded speech. In this embodiment, the method includes augmenting the partial transcript with words from one of a filled pause model and a background model to build an augmented probabilistic finite state model for the partial transcript, inputting the recorded speech and the augmented probabilistic finite state model to a speech recognition system, and generating a hypothesized output for the recorded speech using the speech recognition system, whereby the hypothesized output may be used as the semi-literal transcript. In another embodiment, the method may further include integrating the hypothesized output with the partial transcript to generate the semi-literal transcript of the recorded speech.
In another embodiment of a method for generating a semi-literal transcript from a partial transcript of recorded speech, the invention comprises augmenting the partial transcript with words from a filled pause model and a background model to build an augmented probabilistic finite state model for the partial transcript, inputting the recorded speech and the augmented probabilistic finite state model to a speech recognition system, generating a hypothesized output for the recorded speech using the speech recognition system, and integrating the hypothesized output with the partial transcript to generate a semi-literal transcript of the recorded speech.
In another embodimen

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for generating semi-literal transcripts... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for generating semi-literal transcripts..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for generating semi-literal transcripts... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3011544

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.