Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition
Reexamination Certificate
1998-06-15
2003-07-29
Banks-Harold, Marsha D. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Speech signal processing
Recognition
C704S251000, C704S275000
Reexamination Certificate
active
06601027
ABSTRACT:
TECHNICAL FIELD
The invention relates to position manipulation in speech recognition.
BACKGROUND
A speech recognition system analyzes a user's speech to determine what the user said. Most speech recognition systems are frame-based. In a frame-based system, a processor divides a signal descriptive of the speech to be recognized into a series of digital frames, each of which corresponds to a small time increment of the speech.
A continuous speech recognition system can recognize spoken words or phrases regardless of whether the user pauses between them. By contrast, a discrete speech recognition system recognizes discrete words or phrases and requires the user to pause briefly after each discrete word or phrase. Continuous speech recognition systems typically have a higher incidence of recognition errors in comparison to discrete recognition systems due to complexities of recognizing continuous speech. A more detailed description of continuous speech recognition is provided in U.S. Pat. No. 5,202,952, entitled “LARGE-VOCABULARY CONTINUOUS SPEECH PREFILTERING AND PROCESSING SYSTEM,” which is incorporated by reference.
In general, the processor of a continuous speech recognition system analyzes “utterances” of speech. An utterance includes a variable number of frames and may correspond to a period of speech followed by a pause of at least a predetermined duration.
The processor determines what the user said by finding acoustic models that best match the digital frames of an utterance, and identifying text that corresponds to those acoustic models. An acoustic model may correspond to a word, phrase or command from a vocabulary. An acoustic model also may represent a sound, or phoneme, that corresponds to a portion of a word. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word. Acoustic models also may represent silence and various types of environmental noise.
The words or phrases corresponding to the best matching acoustic models are referred to as recognition candidates. The processor may produce a single recognition candidate (i.e., a single sequence of words or phrases) for an utterance, or may produce a list of recognition candidates.
Correction mechanisms for previous discrete speech recognition systems displayed a list of choices for each recognized word and permitted a user to correct a misrecognition by selecting a word from the list or typing the correct word. For example, DragonDictate® for Windows®, available from Dragon Systems, Inc. of Newton, Mass., displayed a list of numbered recognition candidates (“a choice list”) for each word spoken by the user, and inserted the best-scoring recognition candidate into the text being dictated by the user. If the best-scoring recognition candidate was incorrect, the user could select a recognition candidate from the choice list by saying “choose-N”, where “N” was the number associated with the correct candidate. If the correct word was not on the choice list, the user could refine the list, either by typing in the first few letters of the correct word, or by speaking words (e.g., “alpha”, “bravo”) associated with the first few letters. The user also could discard the incorrect recognition result by saying “scratch that”.
Dictating a new word implied acceptance of the previous recognition. If the user noticed a recognition error after dictating additional words, the user could say “Oops”, which would bring up a numbered list of previously-recognized words. The user could then choose a previously-recognized word by saying “word-N”, where “N” is a number associated with the word. The system would respond by displaying a choice list associated with the selected word and permitting the user to correct the word as described above.
SUMMARY
In one general aspect, an action position in computer-implemented speech recognition is manipulated in response to received data representing a spoken command. The command includes a command identifier and a designation of at least one previously-spoken word. Speech recognition is performed on the data to identify the command identifier and the designation. Thereafter, an action position is established relative to the previously-spoken word based on the command identifier.
Implementations may include one or more of the following features. The designation may include a previously-spoken word or words, or may include a shorthand identifier for a previously-spoken selection or utterance (e.g., “that”).
The command identifier may indicate that the action position is to be before (e.g., “insert before”) or after (e.g., “insert after”) the previously-spoken word, words, or utterance. When this is the case, the action position may be established immediately prior to, or immediately following, the previously-spoken word, words, or utterance.
The designation may include one or more previously-spoken words and one or more new words. In this case, any words following the previously-spoken words included in the command may be replaced by the new words included in the command. The action position then is established after the new words. This command may be implemented, for example, as a “resume with” command in which the words “resume with” are followed by one or more previously-recognized words and one or more new words.
The “resume with” command does not rely on the presentation of information on the display. For that reason, the command is particularly useful when the user records speech using a portable recording device, such as an analog or digital recorder, and subsequently transfers the recorded speech to the speech recognition system for processing. In that context, the “Resume With” command provides the user with a simple and efficient way of redirecting the dictation and eliminating erroneously-spoken words.
The data representing the command may be generated by recording the command using a recording device physically separate from a computer implementing the speech recognition. When the recording device is a digital recording device, the data may be in the form of a file generated by the digital recording device. The data also may be in the form of signals generated by playing back the spoken command using the recording device, such as when an analog recording device is used.
In another general aspect, a block of text is selected in computer-implemented speech recognition in response to data representing a spoken selection command. The command includes a command identifier and a text block identifier identifying a block of previously-recognized text. At least one word included in the block of text is not included in the text block identifier. Speech recognition is performed on the data to identify the command identifier and the text block identifier. Thereafter, the block of text corresponding to the text block identifier is selected.
Implementations may include one or more of the following features. The text block identifier may include at least a first previously-recognized word of the block of text and at least a last previously-recognized word of the block of text. For example, the command identifier may be “select” and the text block identifier may include the first previously-recognized word of the block of text, “through”, and the last previously-recognized word of the block of text (i.e., “select X through Y”). Alternatively, the text block identifier may be a shorthand notation (e.g., “that”) for a previously-spoken selection or utterance.
Speech recognition may be performed using a constraint grammar. The constraint grammar may permit the block of text to start with any word in a set of previously-recognized words and to end with any word in the set of previously-recognized words. The set of previously-recognized words may include previously-recognized words displayed on a display device when the selection command is spoken.
Performing speech recognition may include generating multiple candidates for the text block identifier, and eliminating candidates for which the block of text starts with a previously-recognized word spoken after a previously
Dubach Joev
Gold Allan
Gould Joel M.
Parmenter David W.
Wright Barton D.
Banks-Harold Marsha D.
Lerner Martin
ScanSoft, Inc.
LandOfFree
Position manipulation in speech recognition does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Position manipulation in speech recognition, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Position manipulation in speech recognition will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3081301